Questions tagged [spark-streaming]

1

votes
1

answer
92

Views

Java Spark: com.mongodb.spark.config.writeconfig issue

I am trying to connect with MongoDB via java spark connector and I am getting an error 'com.mongodb.spark.config.writeconfig', when I submit the jar and run the jar in spark shell. Here the error screenshot: Could you please help me to resolve this issue. I have tried this as well, but no success. $...
Tom Swayer
1

votes
0

answer
224

Views

Spark UI storage tab shows more RDDs

I am running a spark streaming application. The app uses MapWithStateRDD to manage state across batches. App and setup details: Nodes: 2 Memory per executor: 3 GB Num of partitions for MapWithStateRDD: 200 Standalone mode Batch size: 20 sec Timeout duration: 1 minute Checkpointing enabled Checkpoint...
scorpio
1

votes
1

answer
265

Views

How is spark.streaming.blockInterval related to RDD partitions?

What is the difference between blocks in spark.streaming.blockInterval and RDD partitions in Spark Streaming? Quoting Spark Streaming 2.2.0 documentation: For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of blocks in...
Rasika Gayani
1

votes
0

answer
31

Views

Importing .ser files (Serialized Java Object) into Apache Spark Streaming

I have a number of .ser files generated by a Java application of mine. I'd like them to be read by a Spark Streaming Java application, but I can't find the right solution to get this done. Here I read some of the guides of the Spark documentation. Can anyone help me? Thank you.
sirdan13
1

votes
1

answer
317

Views

Spark | For a synchronous request/response use case

Spark Newbie alert. I've been exploring the ideas to design a requirement which involves the following: Building a base predictive model for Linear Regression(One off activity) Pass the data points to get the value for the response variable. Do something with the result. At regular intervals update...
user1189332
1

votes
0

answer
325

Views

Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

I am reading a data stream from Event Hub in Spark (using Databricks). My goal is to be able to write the streamed data to a CosmosDB. However I get the following error: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame. Is this scenario not supported?...
fbeltrao
1

votes
1

answer
244

Views

Apache Spark Streaming Custom Receiver(Text File) using Java

I'm new to Apache Spark. I need to read the log files from local/mounted directory. Some external source writing the files into local/mounted directory. E.g. External source writing logs into combined_file.txtfile and once file writing completed the external source create new file with prefix 0_ ,...
PVH
1

votes
0

answer
775

Views

Spark structured streaming with secured Kafka gets freeze on a log message [INFO StreamExecution: Starting new streaming query.]

In order to use structured streaming in my project, I am testing spark 2.2.0 and Kafka 0.10.1 integration with Kerberos on my hortonworks 2.6.3 environment, I am running below sample code to check the integration. I am able to run the below program on IntelliJ on spark local mode with no issues, but...
nilesh1212
1

votes
0

answer
240

Views

Resource not found exception using kinesis spark streaming

I am trying to use spark structured streaming with kinesis. Following is my code. val kinesis = spark .readStream .format('kinesis') .option('streams', streamName) .option('region', 'us-east-1') .option('initialPosition', 'TRIM_HORIZON') .option('endpointUrl', 'kinesis.us-east-1.amazonaws.com') .op...
mc29
1

votes
1

answer
360

Views

auto scale spark cluster

I have a spark streaming job running on a cluster. Spark job pulls messages from Kafka and do the required processing before dumping the processed data to database. I have sized my cluster as per the current load. But this load requirement may go up/down in the future. I want to know the techniques...
scorpio
1

votes
0

answer
264

Views

Spark Streaming groupBy on RDD vs Structured Streaming groupBy on DF (Scala/Spark)

We have a high volume streaming job (Spark/Kafka) and the data (avro) needs to be grouped by a timestamp field inside the payload. We are doing groupBy on RDD to achieve this: RDD[Timestamp, Iterator[Records]]. This works well for decent volume records. But for loads like 150k every 10 seconds, the...
Shawn
1

votes
0

answer
65

Views

Spark Inactivity - Spark Driver stop reading data using TCP streaming after few minutes

I am facing strange issue and tried using Custom Receiver as well. Issue - Spark Driver/Executor stop receiving and displaying data in stdout after 5 min of Activity. It continuously work if we keep writing data to server socket at the other end. There is no error reported in driver or executors log...
Tofeeq
1

votes
1

answer
149

Views

Spark Kafka streaming job doesn't discover coordinator when deployed in DC/OS (Mesos)

I implemented in java a spark streaming job following the instructions at https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html. It works perfectly when Kafka and Cassandra servers are standalone. Extract of logs ... value.deserializer = class org.apache.kafka.common.serializatio...
Daniel Gutierrez
1

votes
0

answer
63

Views

CDH spark steaming consumer kerberos kafka

Does any one tried to use spark-steaming(pyspark) as consumer for kerberos KAFKA in CDH ? I search the CDH and just find some example about Scala. Does it means CDH does not support this ? Anyone can help on this ???
znever
1

votes
0

answer
38

Views

checkpoint variables used in Spark driver

I am streaming data from Kafka , and also maintaining state in my application (by using updateStateByKey) , and so i mandatorily need to checkpoint my data. This is working well. In addition to data from kafka, i am also using some local variables to keep information like total records, and some in...
Amanpreet Khurana
1

votes
0

answer
475

Views

How to read data from multiple kafka topics and compare the data objects

I want to read data from two topics(one for existing application topic, second for same application with performance optimztn) and measure whether data coming is same after performance optimization. My Kafka Topics consist of data in the form of case class Json(id: String, event: String,ingestion_...
Shrey
1

votes
0

answer
348

Views

Which spark-sql-kafka to use for Apche Kafka 1.0.0.?

Which spark-sql-kafka .jar as a dependency when running spark-submit should I use for Apache Kafka 1.0.0.? I use Scale 2.11 and Spark 2.2.1. I would like to run a Spark Structured Streaming application.
Erhan
1

votes
0

answer
137

Views

Spark receiving Data from Kafka topic

I am really new to Spark and Kafka, so I wondered, if there is an easy method to 'read' so to say a kafka topic with sparkstreaming in java. All I could find to this was, that i have to create an JavaInputDstream that subscribes to a certain topic. I did that and tried to test a few functions of the...
maxness
1

votes
1

answer
107

Views

Spark Streaming on Kafka print different cases for different values from kafka

I am stating my scenario below: 10000 - Servers are sending DF size data. ( Every 5 seconds 10,000 inputs are coming ) If for any server DF size is more than 70 % print increase the ROM size by 20 % If for any server DF size used is less than 30 % print decrease the ROM size by 25 %. I am providi...
s.c.
1

votes
0

answer
568

Views

What should I do if an executor node suddenly dies in spark-streaming?

I am using version 1.6 of spark streaming. A few days ago, my spark streaming app(context) suddenly shutdown. Looking at the log, one of the executors seems to be shutdown. (The equipment was actually turned off.) What should I do in case this happens? (Note that the dynamic allocation option is not...
Dogil
1

votes
0

answer
90

Views

Continuous join on two MySQL tables using Spark

Suppose there are two tables in a MySQL database and we are going to run a join on them based on a common column. The point is, data on both tables are growing, so every second there are some rows added to each table. Table 1: ---ID-----SomeColumn------CommonColumn---- ---1----- data row 1 ---------...
Hossein Bakhtiari
1

votes
0

answer
144

Views

Spark Streaming checkpointing + mapWithState + async operations

In my Spark Streaming job I use mapWithState which obligates me to enable checkpointing. A job is trigger in each batch by the operation: stream.foreachRDD(rdd.foreachPartition()) Under this situation the job was checkpoinitng every 10 minutes (batches of 1 minute). Now, I have changed the output o...
DLanza
1

votes
1

answer
58

Views

Separate StreamingContext from Receiver in Spark Streaming

I'd like to generalize the reception in my Main. After setting the SparkConf and the JavaContextStreaming, I'd like to receive an arbitrary object and then pass it to the analyzer. In the cases below I get an exception: Task not serializable Main.java /** * **/ SparkConf conf = new SparkConf().setMa...
Francesco
1

votes
0

answer
503

Views

Spark saveAsTextFile overwrites file after each batch

I am currently trying to use Spark streaming to get input from a Kafka topic and hence save that input in a Json file. I got so far, that I can save my InputDStream as a textFile, but the Problem is, after each batch-process the File gets overwritten, and it seems like I cannot do anything about thi...
maxness
1

votes
1

answer
296

Views

Spark Stateful Streaming with DataFrame

Is it possible to use DataFrame as a State / StateSpec for Spark Streaming? The current StateSpec implementation seems to allow only key-value pair data structure (mapWithState etc..). My objective is to keep a fixed size FIFO buffer as a StateSpec that gets updated every time new data streams in. I...
user9395367
1

votes
0

answer
96

Views

Spark Streaming: in-memory aggregation - correct usage

I have a Spark 2.2 Structured streaming flow from an on-premise system into a containerized cloud spark cluster where kafka recieves the data, and SSS maintains a number of queries that flush to disk every ten seconds. A query console-sink is not accessible to external sessions outside the streaming...
MarkTeehan
1

votes
0

answer
727

Views

java.lang.OutOfMemoryError: Java heap space spark streaming job

I've a spark streaming job. It operates in batches of 10 minutes. The driver machine is m4X4x (64GB) ec2 instance. The job stalled after 18 hours. It crashes on the following exception. As I read the other posts it seems that the driver may have run out of memory. How can I check this? My pyspark c...
dvshekar
1

votes
0

answer
94

Views

How to handle delayed events per group in spark

Spark watermark features comes handy when it comes to delayed events. But I am not sure how to handle a scenario where stream is generated from multiple devices in the field, and some devices my be reporting the events bit late. If we apply a watermark, eventTime watermark is maintained in spark aga...
nids
1

votes
0

answer
57

Views

Spark Structure Streaming - add a batch column with value from currentBatchId

Currently, we are trying to add a new column to the incoming streaming dataframe with value from currentBatchId of StreamExecution. This is required by our use case. Is that possible?
Hiuming Wong
1

votes
0

answer
132

Views

pyspark streaming with kafka error

I am using spark 2.1.0 version with kafka 0.9 in MapR environment.I am trying to read from Kafka topic into spark streaming. However i am facing error as below when i am running Kafkautils createDirectStream command. py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.streami...
Sachin Shetty
1

votes
1

answer
995

Views

Kerberos ticket renewal on Spark streaming job that communicates to Kafka

I have a long running Spark streaming job that runs on a kerberized Hadoop cluster. It fails every few days with the following error: Diagnostics: token (token for XXXXXXX: HDFS_DELEGATION_TOKEN [email protected], renewer=yarn, realUser=, issueDate=XXXXXXXXXXXXXXX, maxDate=XXXXXXXXXX, sequenceN...
David Chen
1

votes
0

answer
181

Views

Unable to read Kinesis stream from SparkStreaming

import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.Milliseconds import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions import com.amazonaws.auth.AWSCredentials import co...
grepIt
1

votes
0

answer
98

Views

How to put spark streaming data from kafka to JSONArray

kafka is Producing data where as Spark is consuming data. JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(1000)); Map kafkaParams = new HashMap(); kafkaParams.put('metadata.broker.list', 'localhost:9092'); Set topics = Collections.singleton('mytopic'); JavaPairInputDStream dire...
Praveen Mandadi
1

votes
0

answer
272

Views

How to create a KinesisSink for Spark Structural Streaming

I am using Spark 2.2 on Databricks and trying to implement a Kinesis sink to write from Spark to a Kinesis stream. I am using the following provide sample from here https://docs.databricks.com/_static/notebooks/structured-streaming-kinesis-sink.html /** * A simple Sink that writes to the given Amazo...
Eoin Lane
1

votes
0

answer
23

Views

Resource usage by Spark Receivers

As per Spark Streaming Guide, A DStream is associated with a single receiver. For attaining read parallelism multiple receivers i.e. multiple DStreams need to be created. A receiver is run within an executor. It occupies one core. Ensure that there are enough cores for processing after receiver slo...
Saheb
1

votes
1

answer
455

Views

Updating Data in MongoDB from Apache Spark Streaming

I am using the scala Api of Apache Spark Streaming to read from Kafka Server in a window with the size of a minute and a slide intervall of a minute. The message from Kafka contain a timestamp from the moment they were sent and an arbitrary value. Each of the values is supposed to be reducedByKeyAnd...
MichaelN
1

votes
0

answer
226

Views

Using Spark fileStream with Avro Data Input

I'm trying to create a Spark Streaming application using fileStream(). The document guide specified: streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory) I need to pass KeyClass, ValueClass, InputFormatClass. My main question is what can I use for these parameters for...
Nk.Pl
1

votes
3

answer
570

Views

Scala spark - Dealing with Hierarchy data tables

I have data table with hierarchy data model with tree structures. For example: Here is a sample data row: ------------------------------------------- Id | name |parentId | path | depth ------------------------------------------- 55 | Canada | null | null | 0 77 | Ontario | 55...
1

votes
1

answer
111

Views

Why are there so many MapWithStateRDDs in Spark

I have a Spark streaming application that uses stateful transformations quite a bit. In retrospect, Spark might have not been the best choice, but I'm still trying to make it work. My question is that why do my MapWithStateRDDs take up so much memory? As an example, I have a transformation where the...
qwe2
1

votes
1

answer
77

Views

Understanding Spark UI for a streaming application

I am trying to understand what the entries in my Spark UI signify. Calling an action results in creation of a job. I am finding hard to understand How many of these jobs get created? and is that proportional to the number of micro-batches? What does the Duration column signify? What is the effect...
vkr

View additional questions