Questions tagged [hdfs]

0

votes
0

answer
7

Views

Run shell script from local directory instead of HDFS via Oozie

I want to run a shell script from the local path(Edge node) instead of hdfs directory via oozie. My local shell script contains ssh steps which I cant run from hdfs directory. XYZ is the userid and xxxx is the server(Edge node). I used below action in the workflow but this is not working. Please hel...
vjrock99
0

votes
0

answer
4

Views

Can i use python's watchdog to watch changes in HDFS directories? If not, how to do it using Python? (everthing I found was in Java)

I want to listen to specific Directories in HDFS using python(everything I found was in java). When a file gets uploaded or moved to a monitored directory i want my python script to do different things with it and then delete all files in this directory. The Problem is, that i can´t find a way to m...
MaxCS
1

votes
2

answer
6.9k

Views

Error: Could not find or load main class org.apache.hadoop.hdfs.server.datanode.DataNode

I am new to apache hadoop. I am installing the multi-node cluster but I am getting two errors. I am not aware about what kind of errors these are and why they have generated.I have googled alot about the errors but I was not able to find out the reason behind the error generation. Error:Could not fi...
1

votes
1

answer
2.4k

Views

Tensorflow Dataset API with HDFS

We have stored a list of *.tfrecord files in a HDFS directory. I'd like to use the new Dataset API but the only example given is to use the old filequeue and string_input_producer (https://www.tensorflow.org/deploy/hadoop). These methods make it difficult to generate epochs amongst other things. I...
Lukeyb
1

votes
3

answer
45

Views

Hive: What Happens if I Manually Copy Data Files into Location Folder of a Table?

I have tried copying data files into the location folder of a table (rather than using the load command), and it works in the sense that I can query the new data. However, all sources that I see will always use load command to do this; they never talk about copying data files directly to the locati...
user1888243
1

votes
2

answer
980

Views

How can we delete specific rows from HDFS?

We have a huge number of text files containing information about clients. We have to delete specific rows from this HDFS file; for example rows associated with the clients X, Y and Z and keeping the others.
Youssef Mchich
1

votes
0

answer
55

Views

Pyspark error reading file. Flume HDFS sink imports file with user=flume and permissions 644

I'm using Cloudera Quickstart VM 5.12 I have a Flume agent moving CSV files from spooldir source into HDFS sink. The operation works ok but the imported files have: User=flume Group=cloudera Permissions=-rw-r--r-- The problem starts when I use Pyspark and get: PriviledgedActionException as:cloude...
Taka
1

votes
1

answer
910

Views

Save spark dataframe schema to hdfs

For a given data frame (df) we get the schema by df.schema, which is a StructType array. Can I save just this schema onto hdfs, while running from spark-shell? Also, what would be the best format in which the schema should be saved?
Ashwin
1

votes
0

answer
117

Views

Unable to load data from multiple level directories into Hive table

I created a table the following way CREATE TABLE `default.tmptbl` (id int, name string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'escapeChar'='\\','quoteChar'='\'','separatorChar'=','); And I have data in HDFS that have been structured in the following way...
1

votes
1

answer
99

Views

Hadoop datanode process listens on random port on localhost

Each datanode process in my hadoop cluster is listening to 4 ports. 3 of them are well known (50010, 50020, 50075) but the 4th one, is chosen at random and bound to localhost. Can anyone shed some light on what is using this port and is this a configurable parameter? Here are the relevant lines from...
boogie
1

votes
0

answer
257

Views

hadoop master host fails to connect to localhost: connection refused

I've setup HDFS with two nodes, on different hosts, in the same network. I'm using HDFS C++ API. Hdfs name node and data nodes start normally, but when I try to read any data, or open a file, I get the following error: Call From master/192.168.X.X to localhost:54310 failed on connection exception: c...
user1289
1

votes
0

answer
36

Views

HDFS block size choice

3 machines (1 master), 44 MB file. If HDFS block size = 32 MB, the file will be divided into two blocks: 32 MB and 12 MB. Does this mean one slave can process 32 MB and the other can process 12 MB parallelly? If HDFS block size = 16 MB, the file would be divided into three blocks: 16 MB, 16MB and 12...
Den
1

votes
0

answer
605

Views

hadoop too many open files issue

what could be causing many open files issue? java.io.IOException: Got error, status message opReadBlock BP-493425312-172.20.178.11-1399995954120:blk_1449181880_375614544 received exception java.io.FileNotFoundException: /hd_data/disk4/hadoop/hdfs/data/current/BP-493425312-172.20.178.11-1399995954120...
blangulin
1

votes
3

answer
383

Views

How to overwrite data from text file into hive table replacing for specific date or for specific value

I am using cloudera Distribution with Hive version 'hive-common-1.1.0-cdh5.14.0' i.e. hive 1.1.0 version. Below is my hive table: hive> describe test; OK id int name string day...
Chaithu
1

votes
1

answer
14

Views

Error in formating the namenode in Hadoop single cluster node

I am trying to install and configure hadoop in the system Ubuntu 16.04, as per the guidelines of https://data-flair.training/blogs/installation-of-hadoop-3-x-on-ubuntu/ all the steps were run successfully, but while trying to run the command hdfs namenode -format, I get a message
Shantha Anand
1

votes
1

answer
121

Views

hdfs jmxget vs hdfs fsck

I have 2 namenodes with several datanodes, but today I've just seen that I have some corrupt blocks. What is awkward is that: hdfs jmxget -server namenode02 -port 8006 | grep CorruptBlocks CorruptBlocks=27 and when I've checked with hdfs fsck / , I've got: Total size: 734930879995888 B (Total op...
alexandru vintila
1

votes
0

answer
44

Views

Getting garbled command line responses when querying webHDFS via CURL

I am getting the following output when trying to write a file into a kerberized HDFS: I get the same output when trying to read from the HDFS as well: Is it an error? If so, how do I fix it? The files I intend to read or write are not being read or written with these commands.
Kristada673
1

votes
2

answer
100

Views

Why is my test cluster running in safe mode?

I'm testing some basic HDFS operations like creating directories. I have the following cluster configuration in my test: import org.apache.hadoop.fs._ import org.apache.hadoop.fs.permission.FsPermission import org.apache.hadoop.hdfs.{HdfsConfiguration, MiniDFSCluster} // ... private val baseDir = ne...
erip
1

votes
1

answer
250

Views

Using flume to import data from kafka topic to hdfs folder

I am using flume to load messages from kafka topic HDFS folder. So, I created a topic TT I sent messages to TT with a kafka console producer I configured the flume agent FF Run the flume agent flume-ng agent -n FF -c conf -f flume.conf - Dflume.root.logger=INFO,console The Code Execution Stops...
wwHh
1

votes
1

answer
71

Views

Spark doesn't read the file properly

I run Flume to ingest Twitter data into HDFS (in JSON format) and run Spark to read that file. But somehow, it doesn't return the correct result: it seems the content of the file is not updated. Here's my Flume configuration: TwitterAgent01.sources = Twitter TwitterAgent01.channels = MemoryChannel01...
Yusata
1

votes
1

answer
188

Views

WebHDFS Java client not handling Kerberos Tokens correctly

I'm trying to run a long-lived WebHDFS client (actually building the Framework in front on HDFS). But my tokens are expiring after one day (default kerberos configuration here), at first I tried running a thread which would call userLoginInformation.currentUser().checkTGTAndReloginFromKeytab(); Howe...
BiS
1

votes
0

answer
67

Views

the role of GENERATE_EEK and GET_METADATA in hdfs transparent encryption

I am little unclear on what the ACL for GENERATE_EEK and GET_METADATA allow. From a naive understanding of HDFS transparent encryption it seems that GENERATE_EEK would be a request to generate an EDEK for an encryption zone (EZ). So say suppose Create a key using the keyadmin user called keyforuserA...
sunny
1

votes
1

answer
442

Views

cannot configure HDFS address using gethue/hue docker image

I'm trying to get the Hue docker image from gethue/hue, but it seems to ignore the configuration I give him and always look for HDFS on localhost instead of the docker container I ask him to look for. Here is some context: I'm using the following docker compose to launch a HDFS cluster: hdfs-namenod...
laurent exsteens
1

votes
1

answer
188

Views

Expanding HDFS memory in Cloudera QuickStart on docker

I try to use the Cloudera QuickStart Docker Image, but it seems that there is no free space on hdfs (0 Bytes). After starting the Container docker run --hostname=$HOSTNAME -p 80:80 -p 7180:7180 -p 8032:8032 -p 8030:8030 -p 8888:8888 -p 8983:8983 -p 50070:50070 -p 50090:50090 -p 50075:50075 -p...
Alex
1

votes
0

answer
128

Views

HDFS configuration error when reading ORC data in HDFS from Vertica

I am using Vertica 7.2 and am trying to access ORC data in HDFS. The directory location in HDFS is '/user//' with all the ORC files that underlie a Hive table that is stored in ORC format. The hadoopConfDir parameter in Vertica has been set to /etc/hadoop/conf. The hadoop conf directory from the sep...
Richard
1

votes
0

answer
58

Views

What is recommended value for hdfs datanode cache?

How many memory should be set for hdfs datanode caching ? OS CentOS Linux 7.4 dfs.datanode.max.locked.memory This determines the maximum amount of memory a DataNode will use for caching
jBee
1

votes
0

answer
116

Views

HiBench wordcount job hangs on hadoop 2.9

I am using: HiBench 7.0 Hadoop 2.9 Java version 1.8.0_161 Scala code runner version 2.11.6 Apache Maven 3.5.2 All on a three-node Hadoop cluster of OpenStack VM's with details: Ubuntu 16.04.3 LTS VCPUs: 8 RAM: 16GB Size: 10GB Each has a 100GB volume attached where the dfs storage is kept Wh...
Diego Delgado
1

votes
0

answer
139

Views

Accessing an HDFS filesystem configured for high availability from H2O

I'm trying to read data out of our Hadoop HDFS filesystem using the h2o.import_file python function. I've set the HADOOP_CONF_DIR environment variable as so: import os os.environ['HADOOP_CONF_DIR'] = '/etc/hadoop/conf' When I try to read a file using the hdfs:///path/to/my/file.txt syntax, H2O gives...
Michael Allman
1

votes
0

answer
404

Views

Getting Regionserver throwing InvalidToken exception in logs

I have noticed following error in my region server logs: org.apache.hadoop.security.token.SecretManager$InvalidToken: access control error while attempting to set up short-circuit access to /apps/hbase/data/data/default/my-table/eb512b4b9f9fa9cb2a1a3930d9c9f18b/r/df1694a4542f419992f86b219541fb6fBloc...
Saurabh
1

votes
1

answer
521

Views

How to pass the awk variable in hdfs command [duplicate]

This question already has an answer here: How do I use shell variables in an awk script? 8 answers I am listing the files/directories which are greather than N days using the below commands DATE=`date +%Y-%m-%d` dt=`date --date '$dt' +%Y%m%d` loop_dt=`date -I --date '$dt -1 day'` *** output of...
user8587005
1

votes
0

answer
166

Views

Error in executing PIg script

I am trying to execute a pig script which is placed in HDFS. I am getting an error. Pig Stack Trace ERROR 2999: Unexpected internal error. null java.lang.NullPointerException at org.apache.pig.impl.io.FileLocalizer.fetchFilesInternal(FileLocalizer.java:734) at org.apache.pig.impl.io.FileLocalizer.fe...
arunkindra
1

votes
1

answer
114

Views

In Hadoop, is it a bad idea to partition table over date?

I was going through the answer given by Roberto in the following post. What is the difference between partitioning and bucketing a table in Hive ? And it seems like partitioning data over date (if my data is coming daily) is not a good idea, as it will end up creating many directories and files in...
Gaurang Shah
1

votes
1

answer
113

Views

HDFS Showing 0 Blocks after cluster reboot

I've setup a small cluster for testing / academic proposes, I have 3 nodes, one of which is acting both as namenode and datanode (and secondarynamenode). I've uploaded 60GB of files (about 6.5 Million files) and uploads started to get really slow, so I read on the internet that I could stop the seco...
Rodrigo Rodrigues
1

votes
0

answer
117

Views

Appending to a file without closing it in HDFS

I am trying to append some text in HDFS, but it turns out that I have to close the output stream fin_a.close() after every appending. Is there any way which I can append to a file without closing it? try{ Path filenamePath = new Path(appId + '_' + 'file.txt'); FSDataOutputStream fin_a = null; FSData...
Self
1

votes
0

answer
166

Views

Storing small size big quantities image on hdfs for later processing

I am working on a project at which we have a billion of images with their metadata on MongoDB. I want to store this image on HDFS for later image processing. The size of image is between 500K to 4MB, thus, I have the problem of small files with Hadoop. I found 3 main possible solutions for this pro...
bob-cac
1

votes
0

answer
91

Views

Solr Index on Hive - Fails for more rows

I created a Solr Index on a Hive table with below steps. This worked for 25 rows which were accessible from the Solr Collection. But when I tried to load 1000 rows from the Hive Internal table to the Hive External table, it failed. Pls help. 1) CREATE TABLE ER_ENTITY1000(entityid INT,claimid_s INT,f...
Voila
1

votes
0

answer
140

Views

Read data from HDFS

I'm using the FSDataInputStream library to access the data from HDFS The following is the snippet which I'm using val fs = FileSystem.get(new java.net.URI(#HDFS_URI),new Configuration()) val stream = fs.open(new Path(#PATH)) val reader = new BufferedReader(new InputStreamReader(stream)) val offset:S...
Vysh
1

votes
1

answer
36

Views

How can I create an external table in HIVE from HDFS file that contains a JSON array?

My json looks like this: [ { 'blocked': 1, 'object': { 'ip': 'abc', 'src_ip': 'abc', 'lan_initiated': true, 'detection': 'abc', 'src_port': , 'src_mac': 'abc', 'dst_mac': 'abc', 'dst_ip': 'abc', 'dst_port': 'abc' }, 'object_type': 'url', 'threat': '', 'threat_type': 'abc', 'device_id': 'abc', 'app_...
Lira
1

votes
1

answer
48

Views

how to split a single file into mutliple .sh files and execute one after other

I have a file which consists of mv & cp commands over 40000 lines. I want to split it into 20 or N files as as a shell file and run it one after another in sequence, Example if a.sh completes then I want to execute b.sh and so Example the file has hdfs dfs -mv /source/path/file.xt /destination/pat...
user8587005
1

votes
1

answer
1.6k

Views

Copy files from local to hdfs

I'm trying to copy a file from my local machine into hdfs. I'm using this command hadoop fs -put Desktop/unsed cubes.txt /user/file And I'm getting this exception -put: java.net.UnknownHostException: sandbox.hortonworks Usage: hadoop fs [generic options] -put [-f] [-p] [-l] ... I've tried using th...
Teddy

View additional questions