Questions tagged [apache-beam]

1

votes
1

answer
47

Views

Accessing information (Metadata) in the file name & type in a Beam pipeline

My filename contains information that I need in my pipeline, for example the identifier for my data points is part of the filename and not a field in the data. e.g Every wind turbine generates a file turbine-loc-001-007.csv. e.g And I need the loc data within the pipeline.
Reza Rokni
1

votes
2

answer
79

Views

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like...
kpax
0

votes
0

answer
8

Views

How can I use a custom CombineFn with Combine.GroupedValues?

I've written a CombineFn that has input KV and output KV. I'd like to use Combine.GroupedValues, and the source could would seem to suggest that this is possible, but I'm getting the following error: Incorrect number of type arguments for generic method groupedValues(SerializableFunction,V>) of type...
SJAndersonLA
1

votes
1

answer
472

Views

Difference between gcloud auth activate-service-account --key-file and GOOGLE_APPLICATION_CREDENTIALS

I'm creating a shell script to handle automation for some of our workflows, This workflow include accessing Google Buckets via Apache Beam GCP. I'm using a .json file with my service account, in which situations do i need to use: gcloud auth activate-service-account --key-file myfile.json vs export...
spicyramen
1

votes
1

answer
66

Views

How to extract Google PubSub publish time in Apache Beam

My goal is to be able to access PubSub message Publish Time as recorded and set by Google PubSub in Apache Beam (Dataflow). PCollection pubsubMsg = pipeline.apply('Read Messages From PubSub', PubsubIO.readMessagesWithAttributes() .fromSubscription(pocOptions.getInputSubscription())); Does not seem t...
Evgeny Minkevich
0

votes
0

answer
4

Views

is there way to set timestamp in unbounded source pcollection?

i want to set timestamp to the unbounded pcollection of strings in my solution each line of the pcollection is a row of csv in one field of this row have a timestamp and others fields like number of clicks etc. i want to process the collection with base on its own timestamp (event time )not the d...
Eduardo Camargo
0

votes
0

answer
13

Views

How do I convert a string to JSON and write it to a file?

I am working with a pipeline that reads in files and calculates their checksum. I have it to the point where I can print a KV pair containing the file name and the checksum. I am having some issues in turning this into JSON. I would like these KV Pairs to be output to a JSON file. How would I do thi...
dmc94
1

votes
0

answer
305

Views

Dynamically write to tables in Dataflow

Working on a pipeline in Dataflow. I need write values to multiple big query table where the desired table names are values in a PCollection. For example with class Data as: public class Data{ public List tableName; public String id; public String value; } I will have a PCollection and i would like...
shockawave123
1

votes
1

answer
552

Views

streaming write to gcs using apache beam per element

Current beam pipeline is reading files as stream using FileIO.matchAll().continuously(). This returns PCollection . I want to write these files back with the same names to another gcs bucket i.e each PCollection is one file metadata/readableFile which should be written back to another bucket after s...
user179156
1

votes
0

answer
112

Views

Apache Beam retryable errors appears to not work for certain errors

Hi I am using big query io to write data to tables using the dynamic destinations api. In my big query transform i set InsertRetryPolicy to never() For certain classes of big query errors, the ones I've noticed are schema errors the Policy is used in the BigQueryServicesImpl, from lines 751 to 771....
Luke De Feo
1

votes
1

answer
392

Views

Caching in Apache Beam: Static variable vs Stateful processing

I'm adding a caching functionality to one of the DoFns inside a Dataflow Pipeline in Java. The DoFn is currently using a REST client to send a request to an endpoint (that charges based on number of request, and the response will change roughly every hour) for every input element, and what I want to...
xiu shi
1

votes
1

answer
113

Views

Apache Beam / Dataflow - Delays between steps in pipeline

I'm using Aapche Beam (with dataflow runner) to download binary files (weather forecast, roughly 300 files), then decode them, then store the results as CSV and then load the CSV into BigQuery. ------------ ---------- -------------- ------------------- | Download | ---> | Decode | ---...
benjamin.d
1

votes
0

answer
214

Views

IllegalArgumentException: Unable to convert url (jar:file:/app.jar!/BOOT-INF/classes!/) to file

Built spring boot 2.0.0.RC application with Google Dataflow and other services and deployed with following maven command mvn appengine:deploy. The build goes successful to AppEngine and an instance is created. The problem is at the App Engine dashboard following error is displayed java.lang.reflect....
Yunus Einsteinium
1

votes
0

answer
109

Views

Using a custom WindowFn on google dataflow which closes after an element value

I extended a custom windowing function based on the apache beam Sessions window which starts a new user session if a maximum time gap has past or if a specific user event occurred, following the suggestions in this question. Basically I added this IntervalWindow class: class StoppingIntervalWindow(I...
volé
1

votes
2

answer
169

Views

Cloud Dataflow Console Dashboard not updating

We are using apache-beam python 2.3 with Google Cloud Dataflow. Since about 2 weeks the Cloud Dataflow Dashboard at https://console.cloud.google.com/dataflow is heavily delayed for us (about 30mins - 1h). This comes in 2 flavours: newly started jobs do not show up in the Overview, also the status l...
Ralf Hein
1

votes
1

answer
320

Views

DataFlow Worker Runtime Error

I was running a dataflow job (jobid: 2018-03-13_13_21_09-13427670439683219454). The job stopped running after 1 hour with the following error message: (f38a2b0cb8c28493): Workflow failed. Causes: [...] (6bf57c531051aa32): A work item was attempted 4 times without success. Each time the worker even...
yiqing_hua
1

votes
1

answer
996

Views

Dataflow BigQuery to BigQuery

I am trying to create a dataflow script that goes from BigQuery back to BigQuery. Our main table is massive and has multiple nested fields which breaks the extract capabilities. I'd like to create a simple table that can be extracted containing all the relevant information. The SQL Query 'Select * f...
SpasticCamel
1

votes
0

answer
127

Views

Failure in DataFlow that isn't present in DirectRunner assert isinstance(value, bytes), (value, type(value)) related to processing GroupByValues

I'm reading from a Json file then grouping by a key and then applying a function to the list in Dataflow. from sklearn.metrics import precision_recall_curve def calculate_precision_recall(p_label_collection, label, true_col, pred_col, score_col): label_collection = list(p_label_collection) # Explic...
Andrew Cassidy
1

votes
0

answer
133

Views

Dataflow Spanner API throws IllegalStateException when value of array type column is NULL

I have a dataflow job that writes to Spanner. One of my columns is of type ARRAY which is also nullable. When I try to write a null value for this column from my Dataflow job, I get the following error - Caused by: java.lang.IllegalStateException: Illegal call to getter of null value. at com.google....
Kakaji
1

votes
0

answer
249

Views

'Timely and stateful' processing possible with Apache Beam Java using Dataflow runner?

I'm trying to evaluate using Apache Beam (Java SDK) (specifically for Google Cloud's Dataflow runner) for a somewhat complex state-machine workflow. Specifically I want to take advantage of stateful processing and timers as explained in this blogpost: https://beam.apache.org/blog/2017/08/28/timely-...
mleonard
1

votes
0

answer
92

Views

Apache Beam DynamicAvroDestinations DefaultFilenamePolicy with String instead of ResourceId

According to the write example on https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/io/AvroIO.html The following code should work: public FilenamePolicy getFilenamePolicy(Integer userId) { return DefaultFilenamePolicy.fromParams(new Params().withBaseFilename(baseDir + '/us...
bjorndv
1

votes
0

answer
242

Views

Apache Beam CombineFn coder error

I have a CombineFn of the following type - CombineFn[KV[ItemVectorKey, java.lang.Iterable[ItemVectorValue]], java.util.Map[ItemVectorKey, Double], java.util.Map[ItemVectorKey, Double]] When I try to do Combine.globally(nc) I get the following error - java.lang.IllegalStateException: Unable to retur...
Kakaji
1

votes
1

answer
206

Views

gcp dataflow process element does not go to the next ParDO function

*I fill the map with good data (no null values) but i am not able to go next ParDo function .i tried to debug but didn't understand why is it happening . if anyone is knows what i am doing wrong let me know .I am putting the three ParDo functions .thanks * .apply('Parse XML CarrierManifest ', Par...
hari ram
1

votes
0

answer
137

Views

Pre Processing Data for Tensorflow: InvalidArgumentError

When I run my tensorflow model I am receiving this error InvalidArgumentError: Field 4 in record 0 is not a valid float: latency [[Node: DecodeCSV = DecodeCSV[OUT_TYPE=[DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT_STRING, DT...
Prof. Falken
0

votes
0

answer
7

Views

BigQueryReader temporary dataset/table

I am using Apache Beam Python SDK (apache-beam==2.11.0) to run a Dataflow job with BigQuerySource. Even though I checked the code, BigQueryReader will delete the temporary dataset after the query is done. code, I still see some temporary dataset in my GCP console. Could you help me look into it? A...
WangCHX
1

votes
1

answer
229

Views

What does an Apache Beam Dataflow job do locally?

I'm having some issues with an Apache Beam Python SDK defined Dataflow. If I step through my code it reaches the the pipeline.run() step which I assume means the execution graph was successfully defined. However, the job never registers on the Dataflow monitoring tool, which from this makes me think...
hamdog
1

votes
0

answer
95

Views

Google Dataflow template BigQuery parameter value exception

I am trying to run a google cloud dataflow job created using the 'pubsub to bigquery' template, but no matter how I specify the BigQuery output table parameter, it always throws back the following exception on the job step that tries to write to big query: java.lang.IllegalArgumentException: Table r...
Sebastián Bania
1

votes
0

answer
206

Views

How does Apache Beam handle multiple windows

We are using Beam 2.2 java sdk with Google Dataflow runner. We receive batches of 4 hours' data in PubSub (which does not have timestamps) , which we need to throttle and process the resulting windows one by one as each of these require some state information. We can assign timestamps to this data...
Koushik
1

votes
1

answer
262

Views

Unable to connect to Redis Server when using apache beam sdk

So I have a dataflow job doing p.apply(RedisIO.read() .withEndpoint(, 6379) .withAuth() .withTimeout(60000) .withKeyPattern('UID*')) .apply(ParDo.of(new Format())) .apply(TextIO.write().to(options.getOutput())); The redis endpoint is public authenticated with a password with no firewall settings. Wh...
Shubham Sharma
1

votes
1

answer
212

Views

Apache beam python get distinct values from list of dictionary

I have list of dictionary in dataflow. I have converted the columns into dictionary of key value pair. Now I want to get distinct values of each column. How can i do this in dataflow without sending the entire data as side input. I can achieve this as a side input data.
rachit
1

votes
1

answer
88

Views

How can I obtain thread dumps for Dataflow Python processes?

When a Dataflow worker becomes stuck it would be helpful to be able to get the thread dumps of Python process that the worker is having trouble with. How would I identify and obtain the threadz dump of a stuck Python process?
Traeger Meyer
1

votes
1

answer
115

Views

beamsql with dynamic query

We have an Apache Beam pipeline and need to run multiple BeamSql queries. The queries are not known at the pipeline construction time, but will be known when the pipeline is running. The queries will be updated periodically. Is this possible with BeamSql? We are using the Google Dataflow runner.
Anna Kasikova
1

votes
0

answer
49

Views

Controlling parallelism in ParDo Transform while writing to DB

I am currently in the process of developing a pipeline using Apache Beam with Flink as an execution engine. As a part of the process I read data from Kafka and perform a bunch of transformations that involve joins, aggregations as well as lookups to an external DB. The idea is that we want to have h...
user2277873
1

votes
2

answer
467

Views

Apache Beam: How To Simultaneously Create Many PCollections That Undergo Same PTransform?

Thanks in advance! [+] Issue: I have a lot of files on google cloud, for every file I have to: get the file Make a bunch of Google-Cloud-Storage API calls on each file to index it(e.g. name = blob.name, size = blob.size) unzip it search for stuff in there put the indexing information + stuff found...
1

votes
1

answer
361

Views

Overwrite some partitions of a partitioned table Bigquery

I am currently trying to develop a Dataflow pipeline in order to replace some partitions of a partitioned table. I have a custom partition field which is a date. The input of my pipeline is a file with potentially different dates. I developed a Pipeline : PipelineOptionsFactory.register(BigQueryOp...
John David
1

votes
0

answer
72

Views

How to control accumulation mode when using sequential GroupByKey? Or other idiomatic approach?

My task is equivalent to: given a stream of letters (each letter is associated with the author) build the low-latency histogram of ngrams within a fixed window, assuming that correct ngram consists of a sequence of letters from the same author. My pipeline looks like: | 'Extract and assign timestamp...
Dimitry
1

votes
1

answer
240

Views

Apache BEAM pipeline IllegalArgumentException - Timestamp skew?

I have an existing BEAM pipeline that is handling the data ingested (from Google Pubsub topic) by 2 routes. The 'hot' path does some basic transformation and stores them in Datastore, while the 'cold' path performs fixed hourly windowing for deeper analysis before storage. So far the pipeline has b...
jlyh
1

votes
1

answer
55

Views

How to assign IP ranges to google data flow instances?

I need to move data from google bigquery to elasticsearch instances, For that I have created python dataflow job to copy bigquery table to elasticsearch. But problem is recently they have added IP based restriction on elastic search instances so that it will allow only for specific IP ranges only....
MJK
1

votes
1

answer
408

Views

Google Cloud Dataflow - Python Streaming JSON to PubSub - Differences between DirectRunner and DataflowRunner

Trying to do something conceptually simple but banging my head against the wall. I am trying to create a streaming dataflow job in Python which consumes JSON messages from a PubSub topic/subscription, performs some basic manipulation on each message (in this case, converting the temperature from C t...
Gummy
1

votes
3

answer
162

Views

Run dataflow job from Compute Engine

I am following the quickstart link to run dataflow job https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven It works fine when I run mvn command from google cloud shell. mvn compile exec:java \ -Dexec.mainClass=com.example.WordCount \ -Dexec.args='--project= \ --stagingLocation=g...

View additional questions