Questions tagged [spotify-scio]

1

votes
1

answer
39

Views

Scio testing not accessed counters

I'm building some tests around my pipeline and particularly I have two branches (one where errors are considered, another where successes), on the errors side I have an incrementing counter (ScioMetrics.counter('MetricName').inc()) and when building the tests for the other branch I want to assert t...
Carlos
1

votes
1

answer
161

Views

PubSub watermark not advancing

I've written an Apache Beam job using Scio with the purpose of generating session ids for incoming data records and then enriching them in some manner, before outputting them to BigQuery. Here's the code: val measurements = sc.customInput('ReadFromPubsub', PubsubIO .readMessagesWithAttributes() .wit...
simpaj
1

votes
1

answer
60

Views

What to pass as arguments while creating scioContext using ContextAndArgs in Scio Spotify

I am new to Scio and was trying to learn more about it. I saw some examples in the Scio source code and wanted to run it. But it asks for some argument which I am unaware and are not specified in Docs. val (sc, args) = ContextAndArgs(cmdlineArgs) For this part of the code, I need to pass some argume...
Ankit Agrahari
1

votes
0

answer
44

Views

Beam pipeline does not produce any output after GroupByKey with windowing and I got memory error

purpose: I want to load stream data, then add a key and then count them by key. problem: Apache Beam Dataflow pipline gets a memory error when i try to load and group-by-key a big-size data using streaming approach (unbounded data) . Because it seems that data is accumulated in group-by and it does...
Saeed Mohtasham
1

votes
1

answer
178

Views

Establishing singleton connection with Google Cloud Bigtable in Scala similar to Cassandra

I am trying to implement a real-time recommendation system using the Google Cloud Services. I've already build the engine using Kafka, Apache Storm and Cassandra but I want to create the same engine in Scala using Cloud Pub/Sub, Cloud Dataflow and Cloud Bigtable. So far in Cassandra, since I read a...
billiout
0

votes
0

answer
3

Views

Dataflow TextIO.write issues with scaling

I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise. When kicking off the...
Tuubeee
1

votes
1

answer
328

Views

“GC overhead limit exceeded” for long running streaming dataflow job

Running my streaming dataflow job for a longer period of time tends to end up in a 'GC overhead limit exceeded' error which brings the job to a halt. How can I best proceed to debug this? java.lang.OutOfMemoryError: GC overhead limit exceeded at com.google.cloud.dataflow.worker.repackaged.com.google...
Brodin
1

votes
1

answer
75

Views

Scio/Apache beam, how to map grouped results

I have a simple pipeline that reads from pubsub within a fixed window, parses messages and groups them by a specific property. However if I map after the groupBy my function doesn't seem to get executed. Am I missing something? sc.pubsubSubscription[String](s'projects/$project/subscriptions/$subscri...
Fabio Epifani
4

votes
1

answer
593

Views

How does dataflow trigger AfterProcessingTime.pastFirstElementInPane() work?

In the Dataflow streaming world. My understanding when I say: Window.into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(15)) is that for a fixed window of one hour, the trigger waits or batches the elements...
Anil Muppalla
1

votes
1

answer
115

Views

Maintaining a global state within Apache Beam

We have a PubSub topic with events sinking into BigQuery (though particular DB is almost irrelevant here). Events can come with new unknown properties that eventually should end up as separate BigQuery columns. So, basically I have two questions here: What is the right way for maintaining a global s...
chuwy
2

votes
1

answer
352

Views

Inconsistent behavior on the functioning of the dataflow templates?

When i create a dataflow template, the characteristics of Runtime parameters are not persisted in the template file. At runtime, if i try to pass a value for this parameter, i take a 400 error I'm using Scio 0.3.2, scala 2.11.11 with apache beam (0.6). My parameters are the following : trait XmlImp...
Damien GOUYETTE
2

votes
1

answer
278

Views

How to set up labels in google dataflow jobs using scio?

I want to set up labels for google dataflow jobs for cost allocation purpose. Here is an example of working Java Code: private DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).as(DataflowPipelineOptionsImpl.class); options.setLabels(ImmutableMap.of('key', 'value')); setLabels...
Pradeep
4

votes
1

answer
605

Views

error writing PubSub stream to Cloud Storage using Dataflow

Using SCIO from spotify to write a job for Dataflow , following 2 examples e.g1 and e.g2 to write a PubSub stream to GCS, but get the following error for the below code Error Exception in thread 'main' java.lang.IllegalArgumentException: Write can only be applied to a Bounded PCollection Code obj...
DAR
1

votes
2

answer
467

Views

Scio: groupByKey doesn't work when using Pub/Sub as collection source

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work. sc.pubsubSubscription[String](psSubscription) .withFixedW...
Kakaji
2

votes
1

answer
110

Views

Does Scio TypeSafe BigQuery support with clauses

val query = s'''#standardsql | WITH A AS (SELECT * FROM `prefix.andrews_test_table` LIMIT 1000) | select * from A''' @BigQueryType.fromQuery(query) class Test Is consistently giving me :40: error: Missing query. This query runs fine in BigQuery once I turn off legacySql mode. Should I not expect EVE...
Andrew Cassidy
3

votes
1

answer
1.5k

Views

Google Pub/Sub to Dataflow, avoid duplicates with Record ID

I'm trying to build a Streaming Dataflow Job which read events from Pub/Sub and write them into BigQuery. According to the documentation, Dataflow can detect duplicate messages delivery if a Record ID is used (see: https://cloud.google.com/dataflow/model/pubsub-io#using-record-ids) But even using th...
Vincent Spiewak
2

votes
2

answer
479

Views

Streaming data from CloudSql into Dataflow

We are currently exploring how we can process a big amount of data store in a Google Cloud SQL database (MySQL) using Apache Beam/Google Dataflow. The database stores about 200GB of data in a single table. We successfully read rows from the database using JdbcIO, but so far this was only possible if...
Scarysize