Questions tagged [hadoop]
14856 questions
0
votes
0
answer
7
Views
Run shell script from local directory instead of HDFS via Oozie
I want to run a shell script from the local path(Edge node) instead of hdfs directory via oozie. My local shell script contains ssh steps which I cant run from hdfs directory.
XYZ is the userid and xxxx is the server(Edge node). I used below action in the workflow but this is not working. Please hel...
0
votes
0
answer
3
Views
Hive to escape Null or blank strings with contact_ws
is there a way to escape null seperator whileusing contact_ws. I have a data that is populating lik ,20000 and I want to remove coma for the single population.
Eg: ID value
1 AAA
1 BBBB
2
2 CCCC
3 AAA
4 CCCD
4 DEDED
4
Current Result: After using contact_ws with , as seperator and c...
1
votes
2
answer
6.9k
Views
Error: Could not find or load main class org.apache.hadoop.hdfs.server.datanode.DataNode
I am new to apache hadoop. I am installing the multi-node cluster but I am getting two errors. I am not aware about what kind of errors these are and why they have generated.I have googled alot about the errors but I was not able to find out the reason behind the error generation.
Error:Could not fi...
1
votes
2
answer
35
Views
Update a dataframe with nested fields - Spark
I have two dataframes like below
Df1
+----------------------+---------+
|products |visitorId|
+----------------------+---------+
|[[i1,0.68], [i2,0.42]]|v1 |
|[[i1,0.78], [i3,0.11]]|v2 |
+----------------------+---------+
Df2
+---+----------+
| id| name|
+---+----------...
1
votes
2
answer
4.1k
Views
Saving files in Spark
There are two operations on RDD to save. One is saveAsTextFile and other is saveAsObjectFile. I understand saveAsTextFile, but not saveAsObjectFile. I am new to Spark and scala and hence I am curious about saveAsObjectFile.
1) Is it sequence file from Hadoop or some thing different?
2) Can I read...
1
votes
1
answer
5k
Views
How to compute the intersections and unions of two arrays in Hive?
For example, the intersection
select intersect(array('A','B'), array('B','C'))
should return
['B']
and the union
select union(array('A','B'), array('B','C'))
should return
['A','B','C']
What's the best way to make this in Hive? I have checked the hive documentation, but cannot find any relevant info...
1
votes
2
answer
1.1k
Views
Hadoop installation on windows?
Could any one provide me with step-by-step tutorial how to install hadoop on windows.
I read official manual, but I can't get from it where should I input scripts from manual.
1
votes
1
answer
1.2k
Views
How to merge small files in spark while writing into hive orc table
I am reading csv files from s3 and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files. I have following properties set:
spark.sql('SET hive.merge.sparkfiles = true')
spark.sql('SET hive.merge.mapredfiles = true')
spark.sql('SET hive.mer...
1
votes
1
answer
687
Views
How to combine multiple ORC files (belonging to each partition) in a Partitioned Hive ORC table into a single big ORC file
I have a partitioned ORC table in Hive. After loading the table with all possible partitions I get on HDFS - multiple ORC files i.e. each partition directory on HDFS has an ORC file in it. I need to combine all these ORC files under each partition to a single big ORC file for some use-case.
Can some...
1
votes
1
answer
303
Views
Error on check-env.sh installing Apache Kylin on Hortonworks
I'm trying to install Apache Kylin on a Hortonworks Sandbox following the instructions provided on Apache Kylin install.
I set up on my .bashrc export KYLIN_HOME='/root/kylin' (inside this folder there are the Kylin Binaries ).
In step 3 it say to run bin/check-env.sh to check for a enviroment issu...
1
votes
2
answer
1.1k
Views
Hive View Not Opening
In the Ambari UI of the hortonworks sandbox, I was trying to open Hive View through the account of maria_dev. But however, I was getting the following error:
Service Hive check failed:
Cannot open a hive connection with connect string
jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscovery...
1
votes
2
answer
980
Views
How can we delete specific rows from HDFS?
We have a huge number of text files containing information about clients. We have to delete specific rows from this HDFS file; for example rows associated with the clients X, Y and Z and keeping the others.
1
votes
1
answer
620
Views
hadoop BlockMissingException
I am getting below error:
Diagnostics: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-467931813-10.3.20.155-1514489559979:blk_1073741991_1167 file=/user/oozie/share/lib/lib_20171228193421/oozie/hadoop-auth-2.7.2-amzn-2.jar
Failing this attempt. Failing the application.
Alth...
1
votes
0
answer
851
Views
Airflow HiveOperator Result Set
I'm new to both Airflow and Python, and I'm trying to configure a scheduled report. The report needs to pull data from Hive and email the results.
My code thus far:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.hive_operator import HiveOperator
default_args...
1
votes
1
answer
215
Views
Not able to use HBaseTestingUtility with CDH 5.7
I am trying to use HBaseTestingUtility with CDH 5.7 as mentioned in the below blog and github
http://blog.cloudera.com/blog/2013/09/how-to-test-hbase-applications-using-popular-tools/
https://github.com/sitaula/HBaseTest
I have modified my pom.xml for CDH 5.7 like below
4.0.0
HBaseTest
Test
0.0.1-SN...
1
votes
0
answer
250
Views
How to convert DataFrame to javaRdd with distibuted copy?
I'm new to spark optimizaiton.
I'm trying to read hive data into a dataFrame. Then I'm converting the dataFrame to javaRdd and running a map function on top of it.
The problem I'm facing is, the transformation running on top of javaRdd is running with single task. Also the transformations running on...
1
votes
0
answer
57
Views
Marklogic 9 with Hadoop?
I integrated ML 9 with hadoop using marklogic connector. I want to load data from my local machine to marklogic using hadoop . In the Document they had mentioned there are two ways to load data using hadoop
Importing data from HDFS to ML using MLCP
Exporting data from ML to HDFS using MLCP
What i w...
1
votes
0
answer
166
Views
Insert JSON file into HBase using Hive
I have a simple JSON file that I would like to insert into an HBase table.
My JSON file has the following format:
{
'word1':{
'doc_01':4,
'doc_02':7
},
'word2':{
'doc_06':1,
'doc_02':3,
'doc_12':8
}
}
The HBase table is called inverted_index, it has one column family matches.
I would like to...
1
votes
2
answer
144
Views
nutch1.14 deduplication failed
I have integrated nutch 1.14 along with solr-6.6.0 on CentOS Linux release 7.3.1611 I had given about 10 urls in seedlist which is at /usr/local/apache-nutch-1.13/urls/seed.txt I followed the tutorial
[[email protected] apache-nutch-1.14]# bin/nutch dedup http://ip:8983/solr/
DeduplicationJob: starting...
1
votes
0
answer
1.4k
Views
GPU resource for hadoop 3.0 / yarn
I try to use Hadoop 3.0 GA release with gpu, but when I executed the below shell command, there is an error and not working with gpu. please check the below and just let you know the shell command. I guess that there are misconfigurations from me.
2018-01-09 15:04:49,256 INFO [main] distributedshe...
1
votes
0
answer
162
Views
Crystal Reports integration with Hadoop/Hive/HPLSQL
We are migrating data from Oracle to Hadoop and there is a requirement
to continue use the existing reporting tool(Crystal Report) to generate reports from Hadoop (instead of Oracle)
In the current scenario we are using an Oracle Stored PROC to do few aggregations /logic.
Now with the above requir...
1
votes
1
answer
213
Views
Hive on tez in EMR schedule tasks very slow
I'm trying to use Hive on tez to query orc format data stored in S3. Tez AM scheduled tasks very slow, a lot of Map tasks remained in 'PENDING' for a long time.
There were enough resources in the cluster (quite enough I would say. There were more than 6TB memory and more than 1 thousand vcores avail...
1
votes
1
answer
483
Views
Lunch TDCH to Load to load data from Hive parquet table to Teradata
I need to load data from Hive tables which stored as parquet files to Teradata Database using TDCH(Teradata connector for Hadoop). I use TDCH 1.5.3 and CDH 5.8.3. and Hive 1.1.0
I try to start TDCH usign hadoop jar command and getting the Error:
java.lang.ClassNotFoundException:
org.apache.parquet.h...
1
votes
0
answer
45
Views
Hadoop Streaming with Python doesn't work
I'm trying this Edureka's tutorial about streaming with python, everything is fine, but when I run the script
`hadoop jar /home/carlos/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -file /home/carlos/mapper.py -mapper mapper.py -file /home/carlos/reducer.py -reducer reducer.py -inpu...
1
votes
0
answer
184
Views
Spark : Why some executors are having 0 active tasks and some 13 tasks?
I am trying to read from s3 and do a count on the data frame. I have a cluster of 76 r3.4xlarge(1 master and 75 slaves). I set :
spark.dynamicAllocation.enabled 'true'
maximizeResourceAllocation 'true'
When I checked the Spark UI, I am just seeing :
Just 25 executors - in that only 7 have active ta...
1
votes
1
answer
588
Views
Spark Small ORC Stripes
We use Spark to flatten out clickstream data and then write the same to S3 in ORC+zlib format, I have tried changing many settings in Spark but still the resultant stripe sizes of the ORC file getting created are very small (
1
votes
0
answer
117
Views
Unable to load data from multiple level directories into Hive table
I created a table the following way
CREATE TABLE `default.tmptbl` (id int, name string) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'escapeChar'='\\','quoteChar'='\'','separatorChar'=',');
And I have data in HDFS that have been structured in the following way...
1
votes
1
answer
1.2k
Views
Can not connect to ZooKeeper/Hive from host to Sandbox Hortonworks HDP VM
I downloaded HDP-Sandbox (in an Oracle VirtualBox VM) a while ago, never used it much, and I’m now trying to access data from the outside world using Hive HDBC.
I use hive-jdbc 1.2.2 from apache, which I got from mvnrepository, with all the dependencies in the classpath, or hortonworks JDBC got fr...
1
votes
1
answer
297
Views
Hive sql struct mismatch
I have a table with columns like this:
table
field type(array)
item
cars(string)
isRed(boolean)
information(bigint)
When I perform the following query
select myfield1.isRed
from mytable
where myfield1.isRed = true
I get an error:
Argument type mismatch '': The 1st argument of EQUAL is expected to...
1
votes
1
answer
99
Views
Hadoop datanode process listens on random port on localhost
Each datanode process in my hadoop cluster is listening to 4 ports.
3 of them are well known (50010, 50020, 50075) but the 4th one, is chosen at random and bound to localhost. Can anyone shed some light on what is using this port and is this a configurable parameter?
Here are the relevant lines from...
1
votes
0
answer
293
Views
Hadoop: Secondary NameNode Permission Denied
I'm attempting to run Hadoop in pseudo-distributed mode to learn how the system work. To install it, I've downloaded Hadoop-3.0.0 from the site, untarred it. I've done my configurations as follows (leaving out the configuration tags for brevity):
core-site.xml
fs.defaultFS
hdfs://localhost/
hdsf-sit...
1
votes
1
answer
352
Views
Sample Pyspark program returns [WinError 2] The system cannot find the file
Here is the code I am trying to run. I have set the paths for spark, hadoop, java and python. Using Java 8, Spark 2.2.1 and hadoop 2.7.5.
import random
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('MyFirstStandaloneApp')
sc = SparkContext(conf=conf)
NUM_SAMPLES = 20
def...
1
votes
0
answer
257
Views
hadoop master host fails to connect to localhost: connection refused
I've setup HDFS with two nodes, on different hosts, in the same network. I'm using HDFS C++ API. Hdfs name node and data nodes start normally, but when I try to read any data, or open a file, I get the following error:
Call From master/192.168.X.X to localhost:54310 failed on connection exception: c...
1
votes
0
answer
233
Views
How to compare elements of an array with string in hive
I have created an table with complex data type array in hive. The query is
create table testivr (
mobNo string,
callTime string,
refNo int,
callCat string,
menus array ,
endType string,
duration int,
transferNode string
)
row format delimited
fields terminated by ','
collection items terminated by...
1
votes
1
answer
815
Views
Get HDP version through Spark
We installed a new Spark version so all folders name are named similar to:
ls /etc/hadoop/
2.6.4.0-91 conf conf.backup
and from spark-submit we get
spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.4...
1
votes
1
answer
491
Views
Kafka Connect HDFS Sink with Azure Blob Storage
I want to connect to Azure Blob Storage with Kafka HDFS Sink Connector.
So far I have done:
Set kafka-connect properties:
hdfs.url=wasbs://
hadoop.conf.dir={hadoop_3_home}/etc/hadoop/
hadoop.home={hadoop_3_home}
And in core-site.xml added support for wasbs:
fs.wasbs.impl
org.apache.hadoop.fs.azure.N...
1
votes
1
answer
80
Views
Unable to set up hadoop on my local machine
I am trying to install Hadoop as a pseudo stand alone application on my macbook, and I have been seeing errors.
When I try to execute sbin/start-dfs.sh, I get the following error.
$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: Connection closed by ::1 port 22
Starting datanodes
loc...
1
votes
1
answer
285
Views
Flink write to S3 on EMR
I am trying to write some outputs to S3 using EMR with Flink. I am using Scala 2.11.7, Flink 1.3.2, and EMR 5.11. However, I got the following error:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V
at com.amazon.ws.emr.hadoop.fs....
1
votes
2
answer
735
Views
COUNT() OVER possible using DISTINCT and WINDOWING IN HIVE
I want to calculate the number of distinct port numbers that exist between the current row and the X previous rows (sliding window), where x can be any integer number.
For instance,
If the input is:
ID PORT
1 21
2 22
3 23
4 25
5 25
6 21
The outpu...
1
votes
0
answer
357
Views
How can i find the latest partition in impala tables?
I need to collect the incremental stats frequently on a table, for that, i need to populate the latest partitions for the below variable:
compute incremental stats someSchema.someTable partition (partitionColName=${value});
I have few options with me which I don't want to use for stability and perfo...