Questions tagged [amazon-emr]

0

votes
0

answer
4

Views

Writing to DSE graph from EMR

We are trying to write to write to a DSE graph (cassandra) from EMR and keep getting these errors. My JAR is a shaded jar with the byos dependencies. Any help would be appreciated. java.lang.UnsatisfiedLinkError: org.apache.cassandra.utils.NativeLibraryLinux.getpid()J at org.apache.cassandra.utils.N...
mat77
1

votes
1

answer
336

Views

how to set livy.server.session.timeout on EMR cluster boostrap?

I am creating an EMR cluster, and using jupyter notebook to run some spark tasks. My tasks die after approximately 1 hour of execution, and the error is: An error was encountered: Invalid status code '400' from https://xxx.xx.x.xxx:18888/sessions/0/statements/20 with error payload: 'requirement fail...
bill
1

votes
1

answer
183

Views

XGBoost does not use enough all resources while running Spark in AWS EMR

I'm trying to make a binary classification on a big dataset (5million rows x 450 features) using XGBoost Spark lib in AWS EMR. I've attempted setting many different configurations like: Number of XGboost workers, nthreads, spark.task.cpus, spark.executor.instances, spark.executor.cores. Even though...
Bruno Brito
1

votes
1

answer
454

Views

spark netty version issue

I am using redis client in spark job and getting an exception java.lang.NoSuchMethodError: io.netty.bootstrap.Bootstrap.config()Lio/netty/bootstrap/BootstrapConfig; at org.redisson.client.RedisClient$1$1.operationComplete(RedisClient.java:234) Its due to netty version mismatch Spark used netty vers...
Shushant Arora
1

votes
0

answer
97

Views

Block missing due to resizing an Amazon EMR Cluster

We resize Amazon EMR cluster nodes from console. When we added core nodes to the cluster, BlockMissingException occurs for a few /user/oozie/share/lib/ jars. Replication factor for this /user/oozie/share/lib/ is 3 while default replication factor is 1. Initially the cluster had 3 core nodes but when...
Pooja Soni
1

votes
0

answer
263

Views

log4j:ERROR Could not find value for key log4j.appender.CLA - EMR

I am trying to run sqoop to pull data from mssql and load it to s3. I am trying to do this using Hue UI in AWS EMR, my sqoop version here is 1.4.6. Here is the command, import --connect jdbc:sqlserver://xxx.xxx.xx.xx/DataWarehouse --username **** --password *** --table ****_20120101 --target-dir s3...
ds_user
1

votes
0

answer
224

Views

spark sql dataframe write to S3 failed with “Error closing multipart upload”

So I have a pyspark job that runs on AWS EMR cluster with EMR 5.11.0, Spark 2.2.1. The spark job needs to write a pretty big data frame (~100GB) to S3. At first I tried writing directly to S3 like follows: df = .... # calculate the data frame df.write.mode('append').parquet('s3://...') Then the writ...
seiya
1

votes
0

answer
209

Views

Equivalent EC2 instance for 1 DPU (in AWS Glue)

Can someone please point me an equivalent EMR instance/pricing against one DPU in AWS Glue (which has 4 vCPUs & 16 GB Memory). I need to do some cost comparisons for an ETL scheduled job. For us, cost per month is a criteria to decide between EMR / AWS Glue. Look forward to some inputs please. Th...
Yuva
1

votes
0

answer
251

Views

Unable to run Presto LDAPS from SQL workbench

I am unable to execute any query from Sql-workbench/J for AWS-EMR presto which is Ldaps(SSL/secureLDAP) enabled. Following are the details: Connection String: jdbc:presto://hostname:8446/hive?SSL=true username=admin password=**** I can connect to it successfully, but while executing any query (l...
Aditya Tiwari
1

votes
1

answer
443

Views

saveAsTable for column with spaces failing

I have a piece of pyspark code the converts a dataframe into a physical table: df.write.mode('overwrite).saveAsTable('sometablename') In case the dataframe, df, contains columns which have spaces in their names it fails with the following error: 18/03/08 10:33:29 ERROR CreateDataSourceTableAsSelectC...
Sid
1

votes
0

answer
16

Views

AWS Data pipeline postStepCommand unable to access INPUT1_STAGING_DIR

In EMR Activity of a Data pipline, I am trying to use postStepCommand (as documented here ) to invoke a shell script. As part of it I am trying to access the standard directory paths ${INPUT1_STAGING_DIR} and ${OUTPUT1_STAGING_DIR} But seems like it's not able to access it's value. Is it by design ?
1

votes
2

answer
62

Views

AWS's own example of submitting Pig job does not work due to issue with piggybank.jar

I have been trying to test out submitting Pig jobs on AWS EMR following Amazon's guide. I made the change to the Pig script to ensure that it can find the piggybank.jar as instructed by Amazon. When I run the script I get an ERROR 1070 indicated that one of functions available in piggybank cannot be...
dtilson
1

votes
1

answer
541

Views

Convert JSON to ORC [AWS]

This is my situation: I have an application that rotates json files to an s3 bucket. I would need to convert those files in ORC format to be consulted from Athena or EMR. My first attempt was a lambda programmed in Node, but I didn't find any module for the conversion. I think it can be done more...
justMiLa
1

votes
1

answer
589

Views

AWS EMR Spark is not loading MainClass using custom Jar

I'm trying to create an emr spark cluster with a single custom step. The cluster is created successfully however, the step is not correctly defined. UPDATE I tried to lunch the same cluster via the web console and get the same results. While I specify the Jar location when I save the step the JAR lo...
Charles Green
1

votes
0

answer
318

Views

PySpark fails with exit code 52

I have an Amazon EMR cluster running, to which I submit jobs using the spark-submit shell command. The way I call it: spark-submit --master yarn --driver-memory 10g convert.py The convert.py script is running using PySpark with Python 3.4. After reading in a text file into an RDD, calling any method...
Vlad
1

votes
1

answer
114

Views

Amazon EMR: How to add Amazon EMR MapReduce/Hive/Spark steps with inline shell script in the arguments?

For example, I have two Hive jobs, where the output of one job is used as a argument/variable in the second job. I can successfully run the following comand on terminal to get my result on the master node of the EMR cluster. [[email protected] ~]$ hive -f s3://MyProjectXYZ/bin/GetNewJobDetails_...
1

votes
1

answer
298

Views

Spark no such field METASTORE_CLIENT_FACTORY_CLASS

I am trying to query a hive table using spark in Java. My hive tables are in an EMR cluster 5.12. Spark version is 2.2.1 and Hive 2.3.2. When I ssh into the machine and I connect to the spark-shell I am able to query the hive tables with no issues. But when I try to query using a custom jar then I...
ggeo
1

votes
0

answer
94

Views

Can't reach expected download throughput from S3 to m4.x16large instance

I'm processing data with an m4.x16large instance using the Spark framework. I have to download many TB of data from S3 distributed in many files of ~128MB in size. According to the EC2 instance's specs, I should have up to 20Gbps of bandwidth throughput, this would translate to around ~2GB/s. Howeve...
enzo
1

votes
0

answer
330

Views

EMR bootstrap installing python modules - bootstrap action 1 returned a non-zero return code

I am trying to install python modules through the EMR Console bootstrap actions by following these steps: Uploading a file to: s3://mybucket/bootstrap/install_python_modules.sh containing the following script: #!/bin/sh set -e -x sudo apt-get install python-setuptools sudo easy_install pip sudo pip...
Simonidas
1

votes
0

answer
101

Views

PredictionIO training on EMR Spark Cluster on demand

I want reduce load on a single server which is currently hosting EventServer, PredictionServer and Training. Obviously, this will not be scalable. I want to push the SPARK training job on an EMR cluster which starts on demand, so that our main server is not consumed. pio train -- --master=spark://ma...
ANKIT HALDAR
1

votes
0

answer
230

Views

Running Parallel Threads in a PySpark Job

I'm trying to run parallel threads in a spark job. This works without a hitch when I run the python script from the cli, but my understanding is that is not really capitalizing on the EMR cluster parallel processing benefits. It does not actually save the data when I run as a spark job. I'm not even...
Robin Tanner
1

votes
0

answer
223

Views

AWS EMR Cannot Find Mapper File From S3 Bucket - no such file or directory

I'm trying to run the following command hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py -input s3://foobar/hadoop-samples/input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer aggregate -verbose But Hadoop cannot find...
1

votes
1

answer
248

Views

Trouble to access S3 from Flink job on EMR

I have trouble to access S3 from a Flink job. If I submit my assembled jar for my job, I get an access denied error: Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Reques...
Phil
1

votes
0

answer
132

Views

SparkSession.read.csv from S3 gives java.lang.OutOfMemoryError: Java heap space (Command exiting with ret '137')

I have a spark job that I stripped down completely to: spark.read.option('delimiter', delimiter) .schema(Encoders.product[MyData].schema) .csv('s3://bucket/data/*/*.gz') .as[MyData] to isolate the error and it's still giving me a java.lang.OutOfMemoryError when running on AWS EMR on YARN. The total...
Eric
1

votes
0

answer
183

Views

spark on EMR doesn't find my python modules since EMR 5.11

I run pyspark on AWS EMR since EMR 5.3 and had never encountered this issue until I upgraded to EMR 5.11 or later, this is the full stacktrace: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in sta...
Dudu Lemberberg
1

votes
0

answer
293

Views

Read Multilple json schema with spark

Software Configuration: Hadoop distribution:Amazon 2.8.3 Applications:Hive 2.3.2, Pig 0.17.0, Hue 4.1.0, Spark 2.3.0 Tried to read with multiple json schema, val df = spark.read.option('mergeSchema', 'true').json('s3a://s3bucket/2018/01/01/*') Throws an error, org.apache.spark.sql.AnalysisException...
Kannaiyan
1

votes
1

answer
391

Views

spark-submit from outside AWS EMR cluster

I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode. I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master no...
mm_857
1

votes
0

answer
18

Views

Does EMR support Hbase Replication

I am trying to test replication on two EMR cluster following the instructions mentioned here: http://blog.cloudera.com/blog/2012/07/hbase-replication-overview-2/ I could run all the steps just fine but replication did not happen. Amazon document is EXTREMELY UNCLEAR about whether they support repli...
Alchemist
1

votes
1

answer
275

Views

Copying files from HDFS to S3 on EMR cluster using S3DistCp

I am copying 800 avro files, size around 136 MB, from HDFS to S3 on EMR cluster, but Im getting this exception: 8/06/26 10:53:14 INFO mapreduce.Job: map 100% reduce 91% 18/06/26 10:53:14 INFO mapreduce.Job: Task Id : attempt_1529995855123_0003_r_000006_0, Status : FAILED Error: java.lang.RuntimeExc...
Waqar Ahmed
1

votes
0

answer
86

Views

Improving compute performance of Spark ML ALS

I have a spark job that performs Alternating Least Squares (ALS) on an implicit feedback ratings matrix. I create the ALS object as follows. val als = new ALS() .setCheckpointInterval(5) .setRank(150) .setAlpha(30.0) .setMaxIter(25) .setRegParam(0.001) .setUserCol('userId') .setItemCol('itemId') .se...
Nik
1

votes
0

answer
364

Views

Error while querying Glue Catalog table using Athena

I have a table in glue catalog which is created by glue crawler after parsing json files in s3. Now when I am querying this table using Athena, I am getting below error. Few things about this situation - JSON files are in S3 Glue crawler created tables in glue catalog using json serder table contai...
Aashish Ola
1

votes
0

answer
16

Views

dfs.FSnameSystem.BlockCapacity getting reduced eventually

I have a small application that I am running on a 'EMR' Cluster with 3 nodes. I have a few gigabytes of csv files that are split across multiple files. The application reads the csv files and then converts into '.orc' files. I have a small program that sequentially and synchronously sends limited (l...
Sai Kumar
1

votes
1

answer
282

Views

AWS-EMR error exit code 143

I'm running an analysis on AWS EMR, and I am getting an unexpected SIGTERM error. Some background: I'm running a script that reads in many csv files I have stored on S3, and then performs an analysis. My script is schematically: analysis_script.py import pandas as pd from pyspark.sql import SQLCont...
cracka31
1

votes
0

answer
204

Views

How to correctly launch an EMR cluster that uses Spark, via Java SDK (command-runner.jar vs directly referencing the JAR path)

If I'm trying to run a spark job on EMR using the SDK for Java, which approach is more correct? I've seen both approaches, but currently both break for me, so I'm not sure which one is the approach to take when it comes to creating a HadoopJarStepConfig. Using command-runner.jar (as done in these tw...
Jeremy Lin
1

votes
0

answer
57

Views

Where to find an Oozie Action's STDOUT, STDERR logs once AWS EMR is terminated?

When an EMR cluster is running, the Oozie STDOUT, STDERR and SYSLOG logs can be checked from Web UI, using the Hue application (if installed on the EMR clusters). Once the EMR cluster is terminated, we lose the option of viewing those logs from Hue. I have read that AWS provides an option to store t...
1

votes
1

answer
39

Views

How do you point mrjob EMR to the right AWS account? I keep getting a ssh key invalid message

I have set .mrjob.conf like this (passwords changed): runners: emr: aws_access_key_id: JKDJKAJSLKJAFKLJ aws_secret_access_key: RKLJDKAS/KLASJKFJKSJAKSALLKLKS ec2_key_pair: me-east ec2_key_pair_file: /Users/me/.ssh/me-east.pem ssh_tunnel: true Then I run this on my local machine: python my_script.py...
TheSneak
1

votes
0

answer
102

Views

Fair Scheduler without preemption on YARN

I have the following FairScheduler configuration for a M/R job on EMR 5.16.0 (Hadoop 2.8.3): queue1 - weight 1.0 queue2 - weight 3.0 This 2 queues are under root queue. I start an application app1 on queue1 and given the fact there is nothing else running, the application will take 100% of the EMR c...
LaviniaS
1

votes
1

answer
449

Views

Error on AWS EMR while exporting to S3: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found

I'm trying to export data from EMR master node to S3 bucket, its failing. While executing below line of code from my pyspark code: DF1 .coalesce(1) .write .format('csv') .option('header','true') .save('s3://fittech-bucket/emr/outputs/test_data') below error comes: An error occurred while calling o7...
Sreeni
1

votes
1

answer
422

Views

AWS EMR local disk encryption failed

EMR instance spinned failed with an error Terminated with errors. On the master instance , local disk encryption failed due to internal error. Any pointers will be helpfull
Manoj4068
1

votes
0

answer
166

Views

Hive on AWS EMR does not work

I have a brand new EMR cluster (EMR version 5.16, Hadoop 2.8.4, and Hive 2.3.3). It's connected to the GLU Data Catalog. I can list the tables, describe, and even 'select ' successfully. But, once I start a real query that has to run a Tez or MR jobs on YARN, it fails. Even 'select count() ' on a si...
AWS_Newbie

View additional questions