Questions tagged [google-cloud-dataflow]

1

votes
2

answer
967

Views

Network default is not accessible to Dataflow Service account

Having issues starting a Dataflow job(2018-07-16_04_25_02-6605099454046602382) in a project without a local VPC Network when I get this error Workflow failed. Causes: Network default is not accessible to Dataflow Service account There is a shared VPC connected to the project with a networked called...
Brodin
0

votes
0

answer
5

Views

Stop Streaming pipeline when no more messages to consume

I have a streaming dataflow pipeline job which reads messages from a given pub-sub topic. I understand there is an auto-ack once the bundles are committed. How to make the pipeline stop where there are no more messages to consume?
user3483129
1

votes
1

answer
47

Views

Accessing information (Metadata) in the file name & type in a Beam pipeline

My filename contains information that I need in my pipeline, for example the identifier for my data points is part of the filename and not a field in the data. e.g Every wind turbine generates a file turbine-loc-001-007.csv. e.g And I need the loc data within the pipeline.
Reza Rokni
1

votes
1

answer
37

Views

Duplicate Filename in GCP Storage

I am streaming files to GCP Storage (bucket). Doing so results in a frequent error (roughly 2 million times a day) claiming that my filename policy must generate a unique name. I've tried multiple ways of guaranteeing a unique name such as using currentTimeMillis, currentThread, encrypting the file...
Scicrazed
1

votes
2

answer
79

Views

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like...
kpax
0

votes
0

answer
8

Views

How can I use a custom CombineFn with Combine.GroupedValues?

I've written a CombineFn that has input KV and output KV. I'd like to use Combine.GroupedValues, and the source could would seem to suggest that this is possible, but I'm getting the following error: Incorrect number of type arguments for generic method groupedValues(SerializableFunction,V>) of type...
SJAndersonLA
1

votes
3

answer
162

Views

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it. Normally I would download the csv file onto a server to 'process it' and...
David542
1

votes
1

answer
66

Views

How to extract Google PubSub publish time in Apache Beam

My goal is to be able to access PubSub message Publish Time as recorded and set by Google PubSub in Apache Beam (Dataflow). PCollection pubsubMsg = pipeline.apply('Read Messages From PubSub', PubsubIO.readMessagesWithAttributes() .fromSubscription(pocOptions.getInputSubscription())); Does not seem t...
Evgeny Minkevich
0

votes
0

answer
4

Views

is there way to set timestamp in unbounded source pcollection?

i want to set timestamp to the unbounded pcollection of strings in my solution each line of the pcollection is a row of csv in one field of this row have a timestamp and others fields like number of clicks etc. i want to process the collection with base on its own timestamp (event time )not the d...
Eduardo Camargo
0

votes
0

answer
13

Views

How do I convert a string to JSON and write it to a file?

I am working with a pipeline that reads in files and calculates their checksum. I have it to the point where I can print a KV pair containing the file name and the checksum. I am having some issues in turning this into JSON. I would like these KV Pairs to be output to a JSON file. How would I do thi...
dmc94
1

votes
0

answer
319

Views

Dataflow appears to be stuck. Works with DirectRunner, stuck with DataflowRunner

I have been trying to get the wordcount quickstart tutorial to work as per the instructions here (using Java SDK 2.2.0): https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven When I run the example pipeline locally, I get the expected results. Great! When I run the example pipelin...
Adam Taylor
1

votes
1

answer
552

Views

streaming write to gcs using apache beam per element

Current beam pipeline is reading files as stream using FileIO.matchAll().continuously(). This returns PCollection . I want to write these files back with the same names to another gcs bucket i.e each PCollection is one file metadata/readableFile which should be written back to another bucket after s...
user179156
1

votes
0

answer
112

Views

Apache Beam retryable errors appears to not work for certain errors

Hi I am using big query io to write data to tables using the dynamic destinations api. In my big query transform i set InsertRetryPolicy to never() For certain classes of big query errors, the ones I've noticed are schema errors the Policy is used in the BigQueryServicesImpl, from lines 751 to 771....
Luke De Feo
1

votes
1

answer
392

Views

Caching in Apache Beam: Static variable vs Stateful processing

I'm adding a caching functionality to one of the DoFns inside a Dataflow Pipeline in Java. The DoFn is currently using a REST client to send a request to an endpoint (that charges based on number of request, and the response will change roughly every hour) for every input element, and what I want to...
xiu shi
1

votes
1

answer
113

Views

Apache Beam / Dataflow - Delays between steps in pipeline

I'm using Aapche Beam (with dataflow runner) to download binary files (weather forecast, roughly 300 files), then decode them, then store the results as CSV and then load the CSV into BigQuery. ------------ ---------- -------------- ------------------- | Download | ---> | Decode | ---...
benjamin.d
1

votes
1

answer
54

Views

Stackdriver Alerting is not available for dataflow jobs run in the region europe-west1

I'm trying to create an Alerting policy for my google-cloud-dataflow job and noticed that its available only when the job is running in us-central1 region. The jobs which are running in europe-west1 are not listed in the Resources section in Stackdriver. How can I create alerting policies for my job...
mmziyad
1

votes
0

answer
109

Views

Using a custom WindowFn on google dataflow which closes after an element value

I extended a custom windowing function based on the apache beam Sessions window which starts a new user session if a maximum time gap has past or if a specific user event occurred, following the suggestions in this question. Basically I added this IntervalWindow class: class StoppingIntervalWindow(I...
volé
1

votes
2

answer
169

Views

Cloud Dataflow Console Dashboard not updating

We are using apache-beam python 2.3 with Google Cloud Dataflow. Since about 2 weeks the Cloud Dataflow Dashboard at https://console.cloud.google.com/dataflow is heavily delayed for us (about 30mins - 1h). This comes in 2 flavours: newly started jobs do not show up in the Overview, also the status l...
Ralf Hein
1

votes
1

answer
320

Views

DataFlow Worker Runtime Error

I was running a dataflow job (jobid: 2018-03-13_13_21_09-13427670439683219454). The job stopped running after 1 hour with the following error message: (f38a2b0cb8c28493): Workflow failed. Causes: [...] (6bf57c531051aa32): A work item was attempted 4 times without success. Each time the worker even...
yiqing_hua
1

votes
1

answer
996

Views

Dataflow BigQuery to BigQuery

I am trying to create a dataflow script that goes from BigQuery back to BigQuery. Our main table is massive and has multiple nested fields which breaks the extract capabilities. I'd like to create a simple table that can be extracted containing all the relevant information. The SQL Query 'Select * f...
SpasticCamel
1

votes
0

answer
146

Views

SpannerIO Dataflow 2.3.0 stuck in CreateDataflowView

In the pipeline, I'm reading from Pub/Sub and I'm attempting to write to spanner. Writing to BigTable works, but spanner is a better option for my needs. In the image below I've expanded the relevant steps. In the top right corner is the 'Debug spanner' step, which shows the proper messages via LOG....
Flavius
1

votes
0

answer
127

Views

Failure in DataFlow that isn't present in DirectRunner assert isinstance(value, bytes), (value, type(value)) related to processing GroupByValues

I'm reading from a Json file then grouping by a key and then applying a function to the list in Dataflow. from sklearn.metrics import precision_recall_curve def calculate_precision_recall(p_label_collection, label, true_col, pred_col, score_col): label_collection = list(p_label_collection) # Explic...
Andrew Cassidy
1

votes
0

answer
208

Views

The tempLocation of Google Cloud Storage in self-executing DataFlow jar

I exported DataFlow application to Runnable JAR file from Eclipse. I get below exception when I run the command Desktop$java -jar mariadevconn.jar Why is the current working directory of command line added to the Google Storage path? Thank you for your help. org.apache.beam.runners.dataflow.options...
Henry Chen
1

votes
0

answer
133

Views

Dataflow Spanner API throws IllegalStateException when value of array type column is NULL

I have a dataflow job that writes to Spanner. One of my columns is of type ARRAY which is also nullable. When I try to write a null value for this column from my Dataflow job, I get the following error - Caused by: java.lang.IllegalStateException: Illegal call to getter of null value. at com.google....
Kakaji
1

votes
0

answer
249

Views

'Timely and stateful' processing possible with Apache Beam Java using Dataflow runner?

I'm trying to evaluate using Apache Beam (Java SDK) (specifically for Google Cloud's Dataflow runner) for a somewhat complex state-machine workflow. Specifically I want to take advantage of stateful processing and timers as explained in this blogpost: https://beam.apache.org/blog/2017/08/28/timely-...
mleonard
1

votes
0

answer
92

Views

Apache Beam DynamicAvroDestinations DefaultFilenamePolicy with String instead of ResourceId

According to the write example on https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/io/AvroIO.html The following code should work: public FilenamePolicy getFilenamePolicy(Integer userId) { return DefaultFilenamePolicy.fromParams(new Params().withBaseFilename(baseDir + '/us...
bjorndv
1

votes
1

answer
115

Views

Dataflow abnormality in time to complete the job and the total CPU hours with reshuflle via random key

I have created a dataflow which takes input from datastore and performs transform to convert it to BigQuery TableRow. I am attaching timestamp with each element in a transform. Then window of one day is applied to the PCollection. The windowed output is written to a partition in BigQuery table using...
Ashley Thomas
1

votes
0

answer
242

Views

Apache Beam CombineFn coder error

I have a CombineFn of the following type - CombineFn[KV[ItemVectorKey, java.lang.Iterable[ItemVectorValue]], java.util.Map[ItemVectorKey, Double], java.util.Map[ItemVectorKey, Double]] When I try to do Combine.globally(nc) I get the following error - java.lang.IllegalStateException: Unable to retur...
Kakaji
1

votes
1

answer
206

Views

gcp dataflow process element does not go to the next ParDO function

*I fill the map with good data (no null values) but i am not able to go next ParDo function .i tried to debug but didn't understand why is it happening . if anyone is knows what i am doing wrong let me know .I am putting the three ParDo functions .thanks * .apply('Parse XML CarrierManifest ', Par...
hari ram
1

votes
2

answer
235

Views

Google Cloud - What products for time series data cleaning?

I have around 20TB of time series data stored in big query. The current pipeline I have is: raw data in big query => joins in big query to create more big query datasets => store them in buckets Then I download a subset of the files in the bucket: Work on interpolation/resampling of data using Pytho...
user1157751
1

votes
1

answer
229

Views

What does an Apache Beam Dataflow job do locally?

I'm having some issues with an Apache Beam Python SDK defined Dataflow. If I step through my code it reaches the the pipeline.run() step which I assume means the execution graph was successfully defined. However, the job never registers on the Dataflow monitoring tool, which from this makes me think...
hamdog
1

votes
0

answer
361

Views

Dataflow: read from one BigQuery table and write to another

I'm trying to run queries to select data from one table in BigQuery and write the data to another BigQuery table. However, I'm running into a typing issue, where BigQueryIO (2.x) says it can't subsequently write the data because the type returned is TypedRead, instead of PCollection Here's my code:...
DrTomCatan
1

votes
0

answer
95

Views

Google Dataflow template BigQuery parameter value exception

I am trying to run a google cloud dataflow job created using the 'pubsub to bigquery' template, but no matter how I specify the BigQuery output table parameter, it always throws back the following exception on the job step that tries to write to big query: java.lang.IllegalArgumentException: Table r...
Sebastián Bania
1

votes
1

answer
57

Views

Can we run multiple google cloud pipelines on the same data simultaneously?

Let's say I have a pipeline that loads data from a Storage file and loads it into a Big Query Table. Before this pipeline is completed, can I run another pipeline that does the same operation on the same file & table? My assumption is that it should fail. Also how would we be able to trigger the sec...
Sammy
1

votes
0

answer
206

Views

How does Apache Beam handle multiple windows

We are using Beam 2.2 java sdk with Google Dataflow runner. We receive batches of 4 hours' data in PubSub (which does not have timestamps) , which we need to throttle and process the resulting windows one by one as each of these require some state information. We can assign timestamps to this data...
Koushik
1

votes
1

answer
262

Views

Unable to connect to Redis Server when using apache beam sdk

So I have a dataflow job doing p.apply(RedisIO.read() .withEndpoint(, 6379) .withAuth() .withTimeout(60000) .withKeyPattern('UID*')) .apply(ParDo.of(new Format())) .apply(TextIO.write().to(options.getOutput())); The redis endpoint is public authenticated with a password with no firewall settings. Wh...
Shubham Sharma
1

votes
1

answer
245

Views

Dataflow can't read from BigQuery dataset in region “asia-northeast1”

I have a BigQuery dataset located in the new 'asia-northeast1' region. I'm trying to run a Dataflow templated pipeline (running in Australia region) to read a table from it. It chucks the following error, even though the dataset/table does indeed exist: Caused by: com.google.api.client.googleapis.js...
Graham Polley
1

votes
1

answer
212

Views

Apache beam python get distinct values from list of dictionary

I have list of dictionary in dataflow. I have converted the columns into dictionary of key value pair. Now I want to get distinct values of each column. How can i do this in dataflow without sending the entire data as side input. I can achieve this as a side input data.
rachit
1

votes
1

answer
339

Views

Dataflow job failing

Dataflow job is failing with below exception and am passing parameters staging,temp & output GCS bucket locations. Java code: final String[] used = Arrays.copyOf(args, args.length + 1); used[used.length - 1] = '--project=OVERWRITTEN'; final T options = PipelineOptionsFactory.fromArgs(used).withVal...
Mohammed Niaz
1

votes
1

answer
88

Views

How can I obtain thread dumps for Dataflow Python processes?

When a Dataflow worker becomes stuck it would be helpful to be able to get the thread dumps of Python process that the worker is having trouble with. How would I identify and obtain the threadz dump of a stuck Python process?
Traeger Meyer

View additional questions