Questions tagged [google-cloud-dataflow]

0

votes
0

answer
2

Views

how to specify a startup script in dataflow job for Java that will execute on each dataflow VM worker

I have a requirement to modify ~/.ssh/authorized_keys to add custom public keys for login. I found this article which is for python job. https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ How can we do same for Java dataflow job.
Sanjay Setia
0

votes
0

answer
3

Views

how to design pipeline which reads from both kafka and pubsub in single step using apache beam

I have a scenario where i need to run single apache beam java job in two different environments (one is on google cloud which reads from pubsub, other is on Flink which reads from kafka). How can i code the abstraction layer over pipelineoptions in order to achieve this.
code tutorial
0

votes
0

answer
3

Views

Pipeline is not fully parallel in Cloud Dataflow using Go SDK

I have Apache Beam code implementation on Go SDK as described below. The pipeline has 3 steps. One is textio.Read, other one is CountLines and the last step is ProcessLines. ProcessLines step takes around 10 seconds time. After I implemented the below code, I found that during pipeline execution, ea...
mdtp
0

votes
0

answer
3

Views

Dataflow TextIO.write issues with scaling

I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise. When kicking off the...
Tuubeee
1

votes
1

answer
685

Views

Pass TupleTag to DoFn method

I am trying to have two outputs from a DoFn method, following example of Apache Beam programming guide Basically in the example you pass a TupleTag and then specify to where make the output, this works for me the problem is that I call an external method inside the ParDo, and don't know how to pass...
Matias
1

votes
2

answer
296

Views

Setting table expiration time using Dataflow BigQuery sink

Is there way to set the expiration time on a BigQuery table when using Dataflow's BigQueryIO.Write sink? For example, I'd like something like this (see last line): PCollection mainResults... mainResults.apply(BigQueryIO.Write .named('my-bq-table') .to('PROJECT:dataset.table') .withSchema(getBigQuery...
Graham Polley
1

votes
1

answer
259

Views

Query cannot have any inequality filters

When running a Dataflow with the DirectPipelineRunner DatastoreV1.Query q = DatastoreV1.Query.newBuilder() .addKind(DatastoreV1.KindExpression.newBuilder().setName('KIND').build()) .setFilter(DatastoreHelper.makeFilter( DatastoreHelper.makeFilter( 'date', DatastoreV1.PropertyFilter.Operator.GREATER...
Jim Keener
1

votes
1

answer
190

Views

Unable to find DEFAULT_INSTANCE when query datastore in dataflow

I am basically just follow the word count example to pull data from datastore in dataflow like DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder(); q.addKindBuilder().setName([entity kind]); DatastoreV1.Query query = q.build(); DatastoreIO.Source source = DatastoreIO.source() .withDataset([...
Xinwei Liu
1

votes
1

answer
429

Views

Can google cloud dataflow (apache beam) use ffmpeg to process video or image data

Can a dataflow process use ffmpeg to process video or images and if so what would a sample workflow look like
mobcdi
1

votes
1

answer
165

Views

Any easier way to flush aggregator to GCS at the end of google dataflow pipeline

I am using Aggregator to log some runtime stats of dataflow job and I want to flush them to either GCS or BQ when the pipeline completes (or each transformer completes). Currently I am doing it by beyond using Aggregator also creating side output by utilizing tupleTag at the same time and flush the...
Xinwei Liu
1

votes
1

answer
976

Views

Dynamic table name when writing to BQ from dataflow pipelines

As a followup question to the following question and answer: https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey I'd like to confirm with google dataflow engineering team (@jkff) if the 3rd option proposed by Eugene is at all possible with google dataflow: 'have a ParDo t...
Alan
1

votes
2

answer
159

Views

Kicking off Dataflow Jobs with App Engine errors with a SecurityException on addShutdownHook for BigQueryTableInserter

I'm attempting to kick of a dataflow job via an (already existing) AppEngine application. The DataFlow job reads data generated by the GAE application stored in DataStore and writes the processed data to BigQuery. I'm receiving the following error. java.lang.SecurityException: Google App Engine does...
Jim Keener
1

votes
3

answer
238

Views

How to add column name as header when using dataflow to export data to csv

I am exporting some data to csv by Dataflow, but beyond data I want to add each column names as the first line on the output file such as col_name1, col_name2, col_name3, col_name4 ... data1.1, data1.2, data1.3, data1.4 ... data2.1 ... Is there anyway to do with current API?(searched around TextIO.W...
Xinwei Liu
1

votes
2

answer
179

Views

Bad Request while runnging wordcount example

I am new to google cloud dataflow. I setup everything on my windows machine and when i tried to run wordcount example using below command: mvn compile exec:java -Dexec.mainClass=com.nyt.dataflowPoc.WordCount -Dexec.args='--project=cdfpoc-1264 --stagingLocation=gs://poc-location/staging --runner=Bloc...
user2846616
1

votes
1

answer
294

Views

How to implement a custom file parser in Google DataFlow for a Google Cloud Storage file

I have a custom file format in Google Cloud Storage and I want to read it from Google DataFlow. I've implemented a Source and a Reader by subclassing FileBasedReader, but then I realized it didn't support reading from Google Cloud Storage (while FileBasedSink actually does...) so I'm not sure what'...
nembleton
1

votes
1

answer
231

Views

Bigtable data trigger/watch

I am looking to get data from bigtable into dataflow in an unbounded fashion such that the processing is triggered based on continuos inserts into the table. The document (https://cloud.google.com/bigtable/docs/dataflow-hbase) only talks about bounded read using scans. Does the connector or big tabl...
Sushant Kumar
1

votes
2

answer
1k

Views

Google Cloud Dataflow error: “The Application Default Credentials are not available”

We have a Google Cloud Dataflow job, which writes to Bigtable (via HBase API). Unfortunately, it fails due to: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CR...
ravwojdyla
1

votes
1

answer
238

Views

Is there a Dataflow source ack API?

As per shutdown and update job in Google Dataflow with PubSubIO + message guarantees the pub/sub source for dataflow does not ack messages until they have been reliably persisted. Is there any possibility for manual control over this? We're persisting rows as a side-effect in a ParDo as there is cur...
bfabry
1

votes
1

answer
65

Views

What happens if I manually delete one of the VMs that Dataflow created?

I see the GCE instances that Dataflow created for my job in the GCE console. What happens if I delete them?
Malo Denielou
1

votes
1

answer
40

Views

Run a single operation in Google Dataflow

How can I write into BigQuery a single row with the time the job started and what options (arguments) were used? In other words, how can I convert a List into a PCollection with 1 entry so it can output to a single row in BigQuery?
cahen
1

votes
1

answer
670

Views

How to produce one output PCollection from multiple inputs PCollections in google data flow pipelines?

I'm trying to produce a PCollection (with elements from type C), from 2 inputs using a tranformer: one PCollection (with elements from type A), and the second one a PCollection. Basically, the transformer takes in account the elements stored in the PCollection and do some computation with the elemen...
Saulo Ricci
1

votes
1

answer
185

Views

Why is “group by” so much slower than my custom CombineFn or asList()?

Mistakenly, many months ago, I used the following logic to essentially implement the same functionality as the PCollectionView's asList() method: I assigned a dummy key of “000” to every element in my collection I then did a groupBy on this dummy key so that I would essentially get a list of all...
1

votes
1

answer
2k

Views

Inserting repeated records into Big Query with Java API/Dataflow - “Repeated field must be imported as a JSON array”

I have data with repeated key-value (String,String) record pairs as one of the fields in a Big Query table schema. I am trying to add these repeated records using the approach here: http://sookocheff.com/post/bigquery/creating-a-big-query-table-java-api/ The table schema created for the repeated rec...
LeeW
1

votes
1

answer
73

Views

Dataflow process has not recovered on failure

Following the recent incidents where a whole AZ would have been lost to an outage, I would like to understand better the Dataflow failover procedures. When I manually deleted the worker nodes for a dataflow job (Streaming, PubSub to BigQuery), they had been successfully recreated/restarted, yet the...
1

votes
1

answer
184

Views

Creating/Writing to Sharded (Dated) BigQuery table via Google Cloud Dataflow

Is there an easy to follow example how to configure a streaming mode Dataflow Pipeline to write each window into a separate BigQuery table (and create one, if necessary)? I.e. - table_20160701, table_20160702, etc.
1

votes
1

answer
354

Views

How do I resolve java.lang.NoSuchMethodError: com.google.api.services.dataflow.model.Environment.setSdkPipelineOptions with Google Cloud Dataflow?

I copied the MinimalWordCount example. I also copied all the dependencies from pom.xml. When I run it with mvn compile exec:java -Dexec.mainClass=com.example.MyExample it compiles, but I get java.lang.NoSuchMethodError: com.google.api.services.dataflow.model.Environment.setSdkPipelineOptions with th...
Tim Swast
1

votes
1

answer
226

Views

Limiting the number of values per-key

Currently we have a dataflow process where we have a GroupByKey but the DoPar after the group-by is getting too many values per key and we wanted to know if there is a good solution for this. From what I can tell there is no way to set maximum number of values per-window. Right now we are exploring...
Narek
1

votes
1

answer
515

Views

Delete range of rows using a prefix key

I'm using the package 'org.apache.hadoop.hbase.client' for dataflow to manage Google's BigTable data. Example to delete a row: key = 'PROV|CLI|800|20160714|8|30302.30301|ES'; byte[] byteKey = Bytes.toBytes(key); Delete delete = new Delete(byteKey); This works fine but I need a way to delete all rows...
Javier Roda
1

votes
1

answer
560

Views

Unable to write nullable integer values to BigQuery using Cloud Dataflow

I am trying to write to a BigQuery table using Cloud Dataflow. This BigQuery table has an integer column which is set to nullable. For null values, it gives following error: Could not convert value to integer. Field: ITM_QT; Value: But when I converted the datatype of the same column to String, it i...
abhishek jha
1

votes
1

answer
95

Views

Dataflow Display Data missing from composite transform

I'm trying the new display data functionality in Dataflow to make additional details show up in the Google Cloud Dataflow UI. However, the display data for custom PTransform's doesn't show up. In my Dataflow pipeline, I have a transform like: Pipeline p = // .. p.apply(new PTransform() { @Override p...
Scott Wegner
1

votes
1

answer
470

Views

Is skipping leading rows when reading files in google dataflow possible

I want to skip leading rows when reading files while using google dataflow. Is that feature available in the lastest version? The files are kept in google storage. I will be writing these files to big query. bq load command has option --skip_leading_rows . This option skips the leading rows when rea...
abhishek jha
1

votes
1

answer
303

Views

Read files from local computer and write to BigQuery or google storage using google dataflow

Is there a way to read csv files from a local computer and write it to big query or storage using google dataflow? If it exists, what runner should be used? All the google dataflow examples either read from cloud and write to either to cloud storage or big query. I use DirectPipelineRunner for readi...
abhishek jha
1

votes
1

answer
267

Views

Google Dataflow not reading more than 3 input compressed files at once when there are multiple sources

Background: I have 30 days data in 30 separate compressed files stored in google storage. I have to write them to a BigQuery table in 30 different partitions in the same table. Each compressed file size was around 750MB. I did 2 experiments on the same data set on Google Dataflow today. Experiment 1...
abhishek jha
1

votes
1

answer
349

Views

Unable to catch exceptions when writing to BigQuery using Google dataflow

I am trying to write to BigQuery using google dataflow. But the data is corrupt because of which the data I am trying to write in a column of a table in BigQuery is not matching the datatype of that column. So the job logs show errors like the one given below: BigQuery job 'dataflow_job_615455482681...
abhishek jha
1

votes
1

answer
119

Views

Transform data to pubsub events

I have a dataflow pipeline that collects user data like navigation, purchases, crud actions etc. I have this requirement to be able to identify patterns real time and then dispatch pubsub events that other services can listen to in order to provide the user real time tips, offers or promotions. I'm...
chchrist
1

votes
1

answer
671

Views

How to count total number of rows in a file using google dataflow

I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as int getCount(String fileName) {} So, above method will return total count of rows and its implementation will be dataflow c...
Programmer
1

votes
1

answer
698

Views

Dataflow Batch Job Stuck in GroupByKey.create()

I have a batch dataflow pipeline that I have ran without issue many times on a subset of our data, with approximately 150k rows of input. I have now attempted to run on our full dataset of approximately 300M rows. A crucial part of the pipeline performs a GroupByKey of the input records resulting...
coryfoo
1

votes
1

answer
423

Views

Dataflow + Datastore = DatastoreException: I/O error

I'm trying to write to DataStore from DataFlow using com.google.cloud.datastore. My code looks like this (inspired by the examples in [1]): public void processElement(ProcessContext c) { LocalDatastoreHelper HELPER = LocalDatastoreHelper.create(1.0); Datastore datastore = HELPER.options().toBuilder(...
JoseKilo
1

votes
1

answer
78

Views

How do I add Java dependencies to a Google Dataflow project?

My Java project has lots of jars from third party libraries as well as my own code. How do I deploy these so that Google Cloud Dataflow can use them? There is documentation on how do this in Python, but not Java.
Joshua Fox
1

votes
2

answer
75

Views

Can SlidingWindows have half second periods in python apache beam?

SlidingWindows seems to be rounding my periods. If I set the period to 1.5 and size to 3.5 seconds, it will create 3.5 windows every 1.0 second. It expected to have 3.5 second windows every 1.5 seconds. Is it possible to have a period that is a fraction of a second?
agsolid

View additional questions