Questions tagged [amazon-emr]
623 questions
0
votes
0
answer
4
Views
Writing to DSE graph from EMR
We are trying to write to write to a DSE graph (cassandra) from EMR and keep getting these errors. My JAR is a shaded jar with the byos dependencies. Any help would be appreciated.
java.lang.UnsatisfiedLinkError: org.apache.cassandra.utils.NativeLibraryLinux.getpid()J
at org.apache.cassandra.utils.N...
1
votes
1
answer
336
Views
how to set livy.server.session.timeout on EMR cluster boostrap?
I am creating an EMR cluster, and using jupyter notebook to run some spark tasks.
My tasks die after approximately 1 hour of execution, and the error is:
An error was encountered:
Invalid status code '400' from https://xxx.xx.x.xxx:18888/sessions/0/statements/20 with error payload: 'requirement fail...
1
votes
1
answer
183
Views
XGBoost does not use enough all resources while running Spark in AWS EMR
I'm trying to make a binary classification on a big dataset (5million rows x 450 features) using XGBoost Spark lib in AWS EMR.
I've attempted setting many different configurations like:
Number of XGboost workers, nthreads, spark.task.cpus, spark.executor.instances, spark.executor.cores.
Even though...
1
votes
1
answer
454
Views
spark netty version issue
I am using redis client in spark job and getting an exception
java.lang.NoSuchMethodError: io.netty.bootstrap.Bootstrap.config()Lio/netty/bootstrap/BootstrapConfig;
at org.redisson.client.RedisClient$1$1.operationComplete(RedisClient.java:234)
Its due to netty version mismatch
Spark used netty vers...
1
votes
0
answer
97
Views
Block missing due to resizing an Amazon EMR Cluster
We resize Amazon EMR cluster nodes from console.
When we added core nodes to the cluster, BlockMissingException occurs for a few /user/oozie/share/lib/ jars.
Replication factor for this /user/oozie/share/lib/ is 3 while default replication factor is 1.
Initially the cluster had 3 core nodes but when...
1
votes
0
answer
263
Views
log4j:ERROR Could not find value for key log4j.appender.CLA - EMR
I am trying to run sqoop to pull data from mssql and load it to s3. I am trying to do this using Hue UI in AWS EMR, my sqoop version here is 1.4.6.
Here is the command,
import --connect jdbc:sqlserver://xxx.xxx.xx.xx/DataWarehouse --username **** --password *** --table ****_20120101 --target-dir s3...
1
votes
0
answer
224
Views
spark sql dataframe write to S3 failed with “Error closing multipart upload”
So I have a pyspark job that runs on AWS EMR cluster with EMR 5.11.0, Spark 2.2.1.
The spark job needs to write a pretty big data frame (~100GB) to S3.
At first I tried writing directly to S3 like follows:
df = .... # calculate the data frame
df.write.mode('append').parquet('s3://...')
Then the writ...
1
votes
0
answer
209
Views
Equivalent EC2 instance for 1 DPU (in AWS Glue)
Can someone please point me an equivalent EMR instance/pricing against one DPU in AWS Glue (which has 4 vCPUs & 16 GB Memory).
I need to do some cost comparisons for an ETL scheduled job. For us, cost per month is a criteria to decide between EMR / AWS Glue.
Look forward to some inputs please.
Th...
1
votes
0
answer
251
Views
Unable to run Presto LDAPS from SQL workbench
I am unable to execute any query from Sql-workbench/J for AWS-EMR presto which is Ldaps(SSL/secureLDAP) enabled. Following are the details:
Connection String: jdbc:presto://hostname:8446/hive?SSL=true
username=admin
password=****
I can connect to it successfully, but while executing any query (l...
1
votes
1
answer
443
Views
saveAsTable for column with spaces failing
I have a piece of pyspark code the converts a dataframe into a physical table:
df.write.mode('overwrite).saveAsTable('sometablename')
In case the dataframe, df, contains columns which have spaces in their names it fails with the following error:
18/03/08 10:33:29 ERROR CreateDataSourceTableAsSelectC...
1
votes
0
answer
16
Views
AWS Data pipeline postStepCommand unable to access INPUT1_STAGING_DIR
In EMR Activity of a Data pipline, I am trying to use postStepCommand (as documented here ) to invoke a shell script. As part of it I am trying to access the standard directory paths ${INPUT1_STAGING_DIR} and ${OUTPUT1_STAGING_DIR}
But seems like it's not able to access it's value. Is it by design ?
1
votes
2
answer
62
Views
AWS's own example of submitting Pig job does not work due to issue with piggybank.jar
I have been trying to test out submitting Pig jobs on AWS EMR following Amazon's guide. I made the change to the Pig script to ensure that it can find the piggybank.jar as instructed by Amazon. When I run the script I get an ERROR 1070 indicated that one of functions available in piggybank cannot be...
1
votes
1
answer
541
Views
Convert JSON to ORC [AWS]
This is my situation:
I have an application that rotates json files to an s3 bucket. I would need to convert those files in ORC format to be consulted from Athena or EMR.
My first attempt was a lambda programmed in Node, but I didn't find any module for the conversion.
I think it can be done more...
1
votes
1
answer
589
Views
AWS EMR Spark is not loading MainClass using custom Jar
I'm trying to create an emr spark cluster with a single custom step.
The cluster is created successfully however, the step is not correctly defined.
UPDATE
I tried to lunch the same cluster via the web console and get the same results. While I specify the Jar location when I save the step the JAR lo...
1
votes
0
answer
318
Views
PySpark fails with exit code 52
I have an Amazon EMR cluster running, to which I submit jobs using the spark-submit shell command.
The way I call it:
spark-submit --master yarn --driver-memory 10g convert.py
The convert.py script is running using PySpark with Python 3.4.
After reading in a text file into an RDD, calling any method...
1
votes
1
answer
114
Views
Amazon EMR: How to add Amazon EMR MapReduce/Hive/Spark steps with inline shell script in the arguments?
For example, I have two Hive jobs, where the output of one job is used as a argument/variable in the second job. I can successfully run the following comand on terminal to get my result on the master node of the EMR cluster.
[[email protected] ~]$ hive -f s3://MyProjectXYZ/bin/GetNewJobDetails_...
1
votes
1
answer
298
Views
Spark no such field METASTORE_CLIENT_FACTORY_CLASS
I am trying to query a hive table using spark in Java. My hive tables are in an EMR cluster 5.12. Spark version is 2.2.1 and Hive 2.3.2.
When I ssh into the machine and I connect to the spark-shell I am able to query the hive tables with no issues.
But when I try to query using a custom jar then I...
1
votes
0
answer
94
Views
Can't reach expected download throughput from S3 to m4.x16large instance
I'm processing data with an m4.x16large instance using the Spark framework. I have to download many TB of data from S3 distributed in many files of ~128MB in size. According to the EC2 instance's specs, I should have up to 20Gbps of bandwidth throughput, this would translate to around ~2GB/s. Howeve...
1
votes
0
answer
330
Views
EMR bootstrap installing python modules - bootstrap action 1 returned a non-zero return code
I am trying to install python modules through the EMR Console bootstrap actions by following these steps:
Uploading a file to: s3://mybucket/bootstrap/install_python_modules.sh containing the following script:
#!/bin/sh
set -e -x
sudo apt-get install python-setuptools
sudo easy_install pip
sudo pip...
1
votes
0
answer
101
Views
PredictionIO training on EMR Spark Cluster on demand
I want reduce load on a single server which is currently hosting EventServer, PredictionServer and Training. Obviously, this will not be scalable.
I want to push the SPARK training job on an EMR cluster which starts on demand, so that our main server is not consumed.
pio train -- --master=spark://ma...
1
votes
0
answer
230
Views
Running Parallel Threads in a PySpark Job
I'm trying to run parallel threads in a spark job. This works without a hitch when I run the python script from the cli, but my understanding is that is not really capitalizing on the EMR cluster parallel processing benefits. It does not actually save the data when I run as a spark job. I'm not even...
1
votes
0
answer
223
Views
AWS EMR Cannot Find Mapper File From S3 Bucket - no such file or directory
I'm trying to run the following command
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py -input s3://foobar/hadoop-samples/input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer aggregate -verbose
But Hadoop cannot find...
1
votes
1
answer
248
Views
Trouble to access S3 from Flink job on EMR
I have trouble to access S3 from a Flink job.
If I submit my assembled jar for my job, I get an access denied error:
Caused by:
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Reques...
1
votes
0
answer
132
Views
SparkSession.read.csv from S3 gives java.lang.OutOfMemoryError: Java heap space (Command exiting with ret '137')
I have a spark job that I stripped down completely to:
spark.read.option('delimiter', delimiter)
.schema(Encoders.product[MyData].schema)
.csv('s3://bucket/data/*/*.gz')
.as[MyData]
to isolate the error and it's still giving me a java.lang.OutOfMemoryError when running on AWS EMR on YARN. The total...
1
votes
0
answer
183
Views
spark on EMR doesn't find my python modules since EMR 5.11
I run pyspark on AWS EMR since EMR 5.3 and had never encountered this issue until I upgraded to EMR 5.11 or later, this is the full stacktrace:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in sta...
1
votes
0
answer
293
Views
Read Multilple json schema with spark
Software Configuration:
Hadoop distribution:Amazon 2.8.3
Applications:Hive 2.3.2, Pig 0.17.0, Hue 4.1.0, Spark 2.3.0
Tried to read with multiple json schema,
val df = spark.read.option('mergeSchema',
'true').json('s3a://s3bucket/2018/01/01/*')
Throws an error,
org.apache.spark.sql.AnalysisException...
1
votes
1
answer
391
Views
spark-submit from outside AWS EMR cluster
I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master no...
1
votes
0
answer
18
Views
Does EMR support Hbase Replication
I am trying to test replication on two EMR cluster following the instructions mentioned here: http://blog.cloudera.com/blog/2012/07/hbase-replication-overview-2/
I could run all the steps just fine but replication did not happen. Amazon document is EXTREMELY UNCLEAR about whether they support repli...
1
votes
1
answer
275
Views
Copying files from HDFS to S3 on EMR cluster using S3DistCp
I am copying 800 avro files, size around 136 MB, from HDFS to S3 on EMR cluster, but Im getting this exception:
8/06/26 10:53:14 INFO mapreduce.Job: map 100% reduce 91%
18/06/26 10:53:14 INFO mapreduce.Job: Task Id : attempt_1529995855123_0003_r_000006_0, Status : FAILED
Error: java.lang.RuntimeExc...
1
votes
0
answer
86
Views
Improving compute performance of Spark ML ALS
I have a spark job that performs Alternating Least Squares (ALS) on an implicit feedback ratings matrix. I create the ALS object as follows.
val als = new ALS()
.setCheckpointInterval(5)
.setRank(150)
.setAlpha(30.0)
.setMaxIter(25)
.setRegParam(0.001)
.setUserCol('userId')
.setItemCol('itemId')
.se...
1
votes
0
answer
364
Views
Error while querying Glue Catalog table using Athena
I have a table in glue catalog which is created by glue crawler after parsing json files in s3. Now when I am querying this table using Athena, I am getting below error. Few things about this situation -
JSON files are in S3
Glue crawler created tables in glue catalog using json serder
table contai...
1
votes
0
answer
16
Views
dfs.FSnameSystem.BlockCapacity getting reduced eventually
I have a small application that I am running on a 'EMR' Cluster with 3 nodes. I have a few gigabytes of csv files that are split across multiple files. The application reads the csv files and then converts into '.orc' files. I have a small program that sequentially and synchronously sends limited (l...
1
votes
1
answer
282
Views
AWS-EMR error exit code 143
I'm running an analysis on AWS EMR, and I am getting an unexpected SIGTERM error.
Some background:
I'm running a script that reads in many csv files I have stored on S3, and then performs an analysis. My script is schematically:
analysis_script.py
import pandas as pd
from pyspark.sql import SQLCont...
1
votes
0
answer
204
Views
How to correctly launch an EMR cluster that uses Spark, via Java SDK (command-runner.jar vs directly referencing the JAR path)
If I'm trying to run a spark job on EMR using the SDK for Java, which approach is more correct? I've seen both approaches, but currently both break for me, so I'm not sure which one is the approach to take when it comes to creating a HadoopJarStepConfig.
Using command-runner.jar (as done in these tw...
1
votes
0
answer
57
Views
Where to find an Oozie Action's STDOUT, STDERR logs once AWS EMR is terminated?
When an EMR cluster is running, the Oozie STDOUT, STDERR and SYSLOG logs can be checked from Web UI, using the Hue application (if installed on the EMR clusters). Once the EMR cluster is terminated, we lose the option of viewing those logs from Hue.
I have read that AWS provides an option to store t...
1
votes
1
answer
39
Views
How do you point mrjob EMR to the right AWS account? I keep getting a ssh key invalid message
I have set .mrjob.conf like this (passwords changed):
runners:
emr:
aws_access_key_id: JKDJKAJSLKJAFKLJ
aws_secret_access_key: RKLJDKAS/KLASJKFJKSJAKSALLKLKS
ec2_key_pair: me-east
ec2_key_pair_file: /Users/me/.ssh/me-east.pem
ssh_tunnel: true
Then I run this on my local machine:
python my_script.py...
1
votes
0
answer
102
Views
Fair Scheduler without preemption on YARN
I have the following FairScheduler configuration for a M/R job on EMR 5.16.0 (Hadoop 2.8.3):
queue1 - weight 1.0
queue2 - weight 3.0
This 2 queues are under root queue.
I start an application app1 on queue1 and given the fact there is nothing else running, the application will take 100% of the EMR c...
1
votes
1
answer
449
Views
Error on AWS EMR while exporting to S3: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
I'm trying to export data from EMR master node to S3 bucket, its failing.
While executing below line of code from my pyspark code:
DF1
.coalesce(1)
.write
.format('csv')
.option('header','true')
.save('s3://fittech-bucket/emr/outputs/test_data')
below error comes:
An error occurred while calling o7...
1
votes
1
answer
422
Views
AWS EMR local disk encryption failed
EMR instance spinned failed with an error
Terminated with errors. On the master instance , local disk encryption failed due to internal error.
Any pointers will be helpfull
1
votes
0
answer
166
Views
Hive on AWS EMR does not work
I have a brand new EMR cluster (EMR version 5.16, Hadoop 2.8.4, and Hive 2.3.3). It's connected to the GLU Data Catalog. I can list the tables, describe, and even 'select ' successfully. But, once I start a real query that has to run a Tez or MR jobs on YARN, it fails. Even 'select count() ' on a si...