Questions tagged [hadoop]

14856 questions
0

votes
0

answer
7

Views

Run shell script from local directory instead of HDFS via Oozie

I want to run a shell script from the local path(Edge node) instead of hdfs directory via oozie. My local shell script contains ssh steps which I cant run from hdfs directory. XYZ is the userid and xxxx is the server(Edge node). I used below action in the workflow but this is not working. Please hel...
vjrock99
0

votes
0

answer
3

Views

Hive to escape Null or blank strings with contact_ws

is there a way to escape null seperator whileusing contact_ws. I have a data that is populating lik ,20000 and I want to remove coma for the single population. Eg: ID value 1 AAA 1 BBBB 2 2 CCCC 3 AAA 4 CCCD 4 DEDED 4 Current Result: After using contact_ws with , as seperator and c...
Seshi Kumar
1

votes
2

answer
6.9k

Views

Error: Could not find or load main class org.apache.hadoop.hdfs.server.datanode.DataNode

I am new to apache hadoop. I am installing the multi-node cluster but I am getting two errors. I am not aware about what kind of errors these are and why they have generated.I have googled alot about the errors but I was not able to find out the reason behind the error generation. Error:Could not fi...
1

votes
2

answer
35

Views

Update a dataframe with nested fields - Spark

I have two dataframes like below Df1 +----------------------+---------+ |products |visitorId| +----------------------+---------+ |[[i1,0.68], [i2,0.42]]|v1 | |[[i1,0.78], [i3,0.11]]|v2 | +----------------------+---------+ Df2 +---+----------+ | id| name| +---+----------...
yAsH
1

votes
2

answer
4.1k

Views

Saving files in Spark

There are two operations on RDD to save. One is saveAsTextFile and other is saveAsObjectFile. I understand saveAsTextFile, but not saveAsObjectFile. I am new to Spark and scala and hence I am curious about saveAsObjectFile. 1) Is it sequence file from Hadoop or some thing different? 2) Can I read...
1

votes
1

answer
5k

Views

How to compute the intersections and unions of two arrays in Hive?

For example, the intersection select intersect(array('A','B'), array('B','C')) should return ['B'] and the union select union(array('A','B'), array('B','C')) should return ['A','B','C'] What's the best way to make this in Hive? I have checked the hive documentation, but cannot find any relevant info...
Osiris
1

votes
2

answer
1.1k

Views

Hadoop installation on windows?

Could any one provide me with step-by-step tutorial how to install hadoop on windows. I read official manual, but I can't get from it where should I input scripts from manual.
Ph0en1x
1

votes
1

answer
1.2k

Views

How to merge small files in spark while writing into hive orc table

I am reading csv files from s3 and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files. I have following properties set: spark.sql('SET hive.merge.sparkfiles = true') spark.sql('SET hive.merge.mapredfiles = true') spark.sql('SET hive.mer...
doitright
1

votes
1

answer
687

Views

How to combine multiple ORC files (belonging to each partition) in a Partitioned Hive ORC table into a single big ORC file

I have a partitioned ORC table in Hive. After loading the table with all possible partitions I get on HDFS - multiple ORC files i.e. each partition directory on HDFS has an ORC file in it. I need to combine all these ORC files under each partition to a single big ORC file for some use-case. Can some...
Anchit Jatana
1

votes
1

answer
303

Views

Error on check-env.sh installing Apache Kylin on Hortonworks

I'm trying to install Apache Kylin on a Hortonworks Sandbox following the instructions provided on Apache Kylin install. I set up on my .bashrc export KYLIN_HOME='/root/kylin' (inside this folder there are the Kylin Binaries ). In step 3 it say to run bin/check-env.sh to check for a enviroment issu...
Nadia Bastidas
1

votes
2

answer
1.1k

Views

Hive View Not Opening

In the Ambari UI of the hortonworks sandbox, I was trying to open Hive View through the account of maria_dev. But however, I was getting the following error: Service Hive check failed: Cannot open a hive connection with connect string jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscovery...
Witty Counsel
1

votes
2

answer
980

Views

How can we delete specific rows from HDFS?

We have a huge number of text files containing information about clients. We have to delete specific rows from this HDFS file; for example rows associated with the clients X, Y and Z and keeping the others.
Youssef Mchich
1

votes
1

answer
620

Views

hadoop BlockMissingException

I am getting below error: Diagnostics: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-467931813-10.3.20.155-1514489559979:blk_1073741991_1167 file=/user/oozie/share/lib/lib_20171228193421/oozie/hadoop-auth-2.7.2-amzn-2.jar Failing this attempt. Failing the application. Alth...
Pooja Soni
1

votes
0

answer
851

Views

Airflow HiveOperator Result Set

I'm new to both Airflow and Python, and I'm trying to configure a scheduled report. The report needs to pull data from Hive and email the results. My code thus far: from datetime import datetime, timedelta from airflow import DAG from airflow.operators.hive_operator import HiveOperator default_args...
Myles Wehr
1

votes
1

answer
215

Views

Not able to use HBaseTestingUtility with CDH 5.7

I am trying to use HBaseTestingUtility with CDH 5.7 as mentioned in the below blog and github http://blog.cloudera.com/blog/2013/09/how-to-test-hbase-applications-using-popular-tools/ https://github.com/sitaula/HBaseTest I have modified my pom.xml for CDH 5.7 like below 4.0.0 HBaseTest Test 0.0.1-SN...
tuk
1

votes
0

answer
250

Views

How to convert DataFrame to javaRdd with distibuted copy?

I'm new to spark optimizaiton. I'm trying to read hive data into a dataFrame. Then I'm converting the dataFrame to javaRdd and running a map function on top of it. The problem I'm facing is, the transformation running on top of javaRdd is running with single task. Also the transformations running on...
Makubex
1

votes
0

answer
57

Views

Marklogic 9 with Hadoop?

I integrated ML 9 with hadoop using marklogic connector. I want to load data from my local machine to marklogic using hadoop . In the Document they had mentioned there are two ways to load data using hadoop Importing data from HDFS to ML using MLCP Exporting data from ML to HDFS using MLCP What i w...
Private
1

votes
0

answer
166

Views

Insert JSON file into HBase using Hive

I have a simple JSON file that I would like to insert into an HBase table. My JSON file has the following format: { 'word1':{ 'doc_01':4, 'doc_02':7 }, 'word2':{ 'doc_06':1, 'doc_02':3, 'doc_12':8 } } The HBase table is called inverted_index, it has one column family matches. I would like to...
Achraf Oussidi
1

votes
2

answer
144

Views

nutch1.14 deduplication failed

I have integrated nutch 1.14 along with solr-6.6.0 on CentOS Linux release 7.3.1611 I had given about 10 urls in seedlist which is at /usr/local/apache-nutch-1.13/urls/seed.txt I followed the tutorial [[email protected] apache-nutch-1.14]# bin/nutch dedup http://ip:8983/solr/ DeduplicationJob: starting...
SMJ
1

votes
0

answer
1.4k

Views

GPU resource for hadoop 3.0 / yarn

I try to use Hadoop 3.0 GA release with gpu, but when I executed the below shell command, there is an error and not working with gpu. please check the below and just let you know the shell command. I guess that there are misconfigurations from me. 2018-01-09 15:04:49,256 INFO [main] distributedshe...
Kangrok Lee
1

votes
0

answer
162

Views

Crystal Reports integration with Hadoop/Hive/HPLSQL

We are migrating data from Oracle to Hadoop and there is a requirement to continue use the existing reporting tool(Crystal Report) to generate reports from Hadoop (instead of Oracle) In the current scenario we are using an Oracle Stored PROC to do few aggregations /logic. Now with the above requir...
Nina A
1

votes
1

answer
213

Views

Hive on tez in EMR schedule tasks very slow

I'm trying to use Hive on tez to query orc format data stored in S3. Tez AM scheduled tasks very slow, a lot of Map tasks remained in 'PENDING' for a long time. There were enough resources in the cluster (quite enough I would say. There were more than 6TB memory and more than 1 thousand vcores avail...
Harper
1

votes
1

answer
483

Views

Lunch TDCH to Load to load data from Hive parquet table to Teradata

I need to load data from Hive tables which stored as parquet files to Teradata Database using TDCH(Teradata connector for Hadoop). I use TDCH 1.5.3 and CDH 5.8.3. and Hive 1.1.0 I try to start TDCH usign hadoop jar command and getting the Error: java.lang.ClassNotFoundException: org.apache.parquet.h...
Dobroff
1

votes
0

answer
45

Views

Hadoop Streaming with Python doesn't work

I'm trying this Edureka's tutorial about streaming with python, everything is fine, but when I run the script `hadoop jar /home/carlos/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -file /home/carlos/mapper.py -mapper mapper.py -file /home/carlos/reducer.py -reducer reducer.py -inpu...
Carlos Arbeláez
1

votes
0

answer
184

Views

Spark : Why some executors are having 0 active tasks and some 13 tasks?

I am trying to read from s3 and do a count on the data frame. I have a cluster of 76 r3.4xlarge(1 master and 75 slaves). I set : spark.dynamicAllocation.enabled 'true' maximizeResourceAllocation 'true' When I checked the Spark UI, I am just seeing : Just 25 executors - in that only 7 have active ta...
user3407267
1

votes
1

answer
588

Views

Spark Small ORC Stripes

We use Spark to flatten out clickstream data and then write the same to S3 in ORC+zlib format, I have tried changing many settings in Spark but still the resultant stripe sizes of the ORC file getting created are very small (
Rajiv
1

votes
0

answer
117

Views

Unable to load data from multiple level directories into Hive table

I created a table the following way CREATE TABLE `default.tmptbl` (id int, name string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'escapeChar'='\\','quoteChar'='\'','separatorChar'=','); And I have data in HDFS that have been structured in the following way...
1

votes
1

answer
1.2k

Views

Can not connect to ZooKeeper/Hive from host to Sandbox Hortonworks HDP VM

I downloaded HDP-Sandbox (in an Oracle VirtualBox VM) a while ago, never used it much, and I’m now trying to access data from the outside world using Hive HDBC. I use hive-jdbc 1.2.2 from apache, which I got from mvnrepository, with all the dependencies in the classpath, or hortonworks JDBC got fr...
Sxilderik
1

votes
1

answer
297

Views

Hive sql struct mismatch

I have a table with columns like this: table field type(array) item cars(string) isRed(boolean) information(bigint) When I perform the following query select myfield1.isRed from mytable where myfield1.isRed = true I get an error: Argument type mismatch '': The 1st argument of EQUAL is expected to...
jumpman8947
1

votes
1

answer
99

Views

Hadoop datanode process listens on random port on localhost

Each datanode process in my hadoop cluster is listening to 4 ports. 3 of them are well known (50010, 50020, 50075) but the 4th one, is chosen at random and bound to localhost. Can anyone shed some light on what is using this port and is this a configurable parameter? Here are the relevant lines from...
boogie
1

votes
0

answer
293

Views

Hadoop: Secondary NameNode Permission Denied

I'm attempting to run Hadoop in pseudo-distributed mode to learn how the system work. To install it, I've downloaded Hadoop-3.0.0 from the site, untarred it. I've done my configurations as follows (leaving out the configuration tags for brevity): core-site.xml fs.defaultFS hdfs://localhost/ hdsf-sit...
Zach
1

votes
1

answer
352

Views

Sample Pyspark program returns [WinError 2] The system cannot find the file

Here is the code I am trying to run. I have set the paths for spark, hadoop, java and python. Using Java 8, Spark 2.2.1 and hadoop 2.7.5. import random from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName('MyFirstStandaloneApp') sc = SparkContext(conf=conf) NUM_SAMPLES = 20 def...
GLalor
1

votes
0

answer
257

Views

hadoop master host fails to connect to localhost: connection refused

I've setup HDFS with two nodes, on different hosts, in the same network. I'm using HDFS C++ API. Hdfs name node and data nodes start normally, but when I try to read any data, or open a file, I get the following error: Call From master/192.168.X.X to localhost:54310 failed on connection exception: c...
user1289
1

votes
0

answer
233

Views

How to compare elements of an array with string in hive

I have created an table with complex data type array in hive. The query is create table testivr ( mobNo string, callTime string, refNo int, callCat string, menus array , endType string, duration int, transferNode string ) row format delimited fields terminated by ',' collection items terminated by...
Previnkumar
1

votes
1

answer
815

Views

Get HDP version through Spark

We installed a new Spark version so all folders name are named similar to: ls /etc/hadoop/ 2.6.4.0-91 conf conf.backup and from spark-submit we get spark-submit --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.4...
enodmilvado
1

votes
1

answer
491

Views

Kafka Connect HDFS Sink with Azure Blob Storage

I want to connect to Azure Blob Storage with Kafka HDFS Sink Connector. So far I have done: Set kafka-connect properties: hdfs.url=wasbs:// hadoop.conf.dir={hadoop_3_home}/etc/hadoop/ hadoop.home={hadoop_3_home} And in core-site.xml added support for wasbs: fs.wasbs.impl org.apache.hadoop.fs.azure.N...
zhandosso
1

votes
1

answer
80

Views

Unable to set up hadoop on my local machine

I am trying to install Hadoop as a pseudo stand alone application on my macbook, and I have been seeing errors. When I try to execute sbin/start-dfs.sh, I get the following error. $ sbin/start-dfs.sh Starting namenodes on [localhost] localhost: Connection closed by ::1 port 22 Starting datanodes loc...
Kartik
1

votes
1

answer
285

Views

Flink write to S3 on EMR

I am trying to write some outputs to S3 using EMR with Flink. I am using Scala 2.11.7, Flink 1.3.2, and EMR 5.11. However, I got the following error: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V at com.amazon.ws.emr.hadoop.fs....
Chengzhi
1

votes
2

answer
735

Views

COUNT() OVER possible using DISTINCT and WINDOWING IN HIVE

I want to calculate the number of distinct port numbers that exist between the current row and the X previous rows (sliding window), where x can be any integer number. For instance, If the input is: ID PORT 1 21 2 22 3 23 4 25 5 25 6 21 The outpu...
alejo
1

votes
0

answer
357

Views

How can i find the latest partition in impala tables?

I need to collect the incremental stats frequently on a table, for that, i need to populate the latest partitions for the below variable: compute incremental stats someSchema.someTable partition (partitionColName=${value}); I have few options with me which I don't want to use for stability and perfo...
roh

View additional questions