Questions tagged [apache-beam]

1

votes
1

answer
230

Views

Conditional statement Python Apache Beam pipeline

Current situation The porpouse of this pipeline is to read from pub/sub the payload with geodata, then this data are transformed and analyzed and finally return if a condition is true or false with beam.Pipeline(options=pipeline_options) as p: raw_data = (p | 'Read from PubSub' >> beam.io.ReadFromPu...
IoT user
3

votes
0

answer
39

Views

Including BEAM preprocessing graph in Keras models at serving

Short Question: Since Tensorflow is moving towards Keras and away from Estimators, how can we incorporate our preprocessing pipelines e.g. using tf.Transform and build_serving_input_fn() (which are used for estimators), into our tf.keras models? From my understanding, the only way to incorporate thi...
GRS
0

votes
0

answer
2

Views

how to specify a startup script in dataflow job for Java that will execute on each dataflow VM worker

I have a requirement to modify ~/.ssh/authorized_keys to add custom public keys for login. I found this article which is for python job. https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ How can we do same for Java dataflow job.
Sanjay Setia
1

votes
1

answer
30

Views

Pipeline fails when addng ReadAllFromText transform

I am trying to run a very simple program in Apache Beam to try out how it works. import apache_beam as beam class Split(beam.DoFn): def process(self, element): return element with beam.Pipeline() as p: rows = (p | beam.io.ReadAllFromText( 'input.csv') | beam.ParDo(Split())) While running this, I get...
Raheel Khan
0

votes
0

answer
3

Views

how to design pipeline which reads from both kafka and pubsub in single step using apache beam

I have a scenario where i need to run single apache beam java job in two different environments (one is on google cloud which reads from pubsub, other is on Flink which reads from kafka). How can i code the abstraction layer over pipelineoptions in order to achieve this.
code tutorial
0

votes
0

answer
3

Views

Pipeline is not fully parallel in Cloud Dataflow using Go SDK

I have Apache Beam code implementation on Go SDK as described below. The pipeline has 3 steps. One is textio.Read, other one is CountLines and the last step is ProcessLines. ProcessLines step takes around 10 seconds time. After I implemented the below code, I found that during pipeline execution, ea...
mdtp
0

votes
0

answer
3

Views

Dataflow TextIO.write issues with scaling

I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise. When kicking off the...
Tuubeee
1

votes
1

answer
685

Views

Pass TupleTag to DoFn method

I am trying to have two outputs from a DoFn method, following example of Apache Beam programming guide Basically in the example you pass a TupleTag and then specify to where make the output, this works for me the problem is that I call an external method inside the ParDo, and don't know how to pass...
Matias
1

votes
2

answer
680

Views

connect google cloud sql postgres instance from beam pipeline

I want to connect google cloud sql postgres instance from apache beam pipeline running on google dataflow. I want to do this using Python SDK. I am not able to find proper documentation for this. In cloud SQL how to guide I dont see any documentation for dataflow. https://cloud.google.com/sql/docs/...
s28
1

votes
1

answer
300

Views

Apache Beam: What is the difference between DoFn and SimpleFunction?

While reading about processing streaming elements in apache beam using Java, I came across DoFn and then across SimpleFunction. Both of these look similar to me and I find it difficult to understand the difference. Can someone explain the difference in layman terms?
kaxil
1

votes
1

answer
73

Views

Dataflow process has not recovered on failure

Following the recent incidents where a whole AZ would have been lost to an outage, I would like to understand better the Dataflow failover procedures. When I manually deleted the worker nodes for a dataflow job (Streaming, PubSub to BigQuery), they had been successfully recreated/restarted, yet the...
1

votes
1

answer
184

Views

Creating/Writing to Sharded (Dated) BigQuery table via Google Cloud Dataflow

Is there an easy to follow example how to configure a streaming mode Dataflow Pipeline to write each window into a separate BigQuery table (and create one, if necessary)? I.e. - table_20160701, table_20160702, etc.
1

votes
1

answer
226

Views

Limiting the number of values per-key

Currently we have a dataflow process where we have a GroupByKey but the DoPar after the group-by is getting too many values per key and we wanted to know if there is a good solution for this. From what I can tell there is no way to set maximum number of values per-window. Right now we are exploring...
Narek
1

votes
1

answer
95

Views

Dataflow Display Data missing from composite transform

I'm trying the new display data functionality in Dataflow to make additional details show up in the Google Cloud Dataflow UI. However, the display data for custom PTransform's doesn't show up. In my Dataflow pipeline, I have a transform like: Pipeline p = // .. p.apply(new PTransform() { @Override p...
Scott Wegner
1

votes
2

answer
75

Views

Can SlidingWindows have half second periods in python apache beam?

SlidingWindows seems to be rounding my periods. If I set the period to 1.5 and size to 3.5 seconds, it will create 3.5 windows every 1.0 second. It expected to have 3.5 second windows every 1.5 seconds. Is it possible to have a period that is a fraction of a second?
agsolid
1

votes
2

answer
727

Views

How to read BigQuery nested tables using Dataflow Python SDK

How can I read nested structures using Apache Beam Python SDK? lines = p | io.Read(io.BigQuerySource('project:test.beam_in')) result in 'reason': 'invalidQuery', 'message': 'Cannot output multiple independently repeated fields at the same time. Found classification_item_distribution and category_ca...
Evgeny Minkevich
1

votes
1

answer
666

Views

BigqueryIO Unable to Write to Date-Partitioned Table

I am following the instructions in the following post to write to a date-partitioned table in BigQuery. I am using a serializable function to map the the window to a partition-location using the $ syntax and I get the following error: Invalid table ID \'table$19700822\'. Table IDs must be alphanumer...
Narek
1

votes
1

answer
352

Views

IllegalMutationException from Beam PTransform

This is the Apache Beam PTransform that I wrote: public class NormalizeTransform extends PTransform { @Override public PCollection expand(PCollection lines) { ExtractFields extract_i = new ExtractFields(); PCollection table = lines .apply('Extracting data model fields from lines', ParDo.of(extract_i...
bignano
1

votes
1

answer
106

Views

Beam GroupByKey in Spark Streaming on Yarn

I am currently trying to run beam pipeline with windowing and groupbykey over the spark runner. Locally, it works fully, but in yarn mode, it seems to not trigger panes after GroupByKey.create() down the stream at all (no final hbase mutations). All ParDos before grouping successfully log the mess...
azagrebin
1

votes
1

answer
308

Views

Initializing external service connections in Beam

I am writing a Dataflow streaming pipeline. In one of the transformations, DoFn I want to access an external service - in this case, it is Datastore. Is there any best practice for this sort of initialization step? I don't want to create the datastore connection object for every processElement metho...
ganesh_patil
1

votes
1

answer
310

Views

Element value based writing to Google Cloud Storage using Dataflow

I'm trying to build a dataflow process to help archive data by storing data into Google Cloud Storage. I have a PubSub stream of Event data which contains the client_id and some metadata. This process should archive all incoming events, so this needs to be a streaming pipeline. I'd like to be able t...
Samuel Wu
1

votes
2

answer
365

Views

Forcing an empty pane/window in streaming in Apache Beam

I am trying to implement a pipeline and takes in a stream of data and every minutes output a True if there is any element in the minute interval or False if there is none. The pane (with forever time trigger) or window (fixed window) does not seem to trigger if there is no element for the duration....
yihuaf
1

votes
2

answer
554

Views

Apache Beam ETL dimension table loading , any example?

I am thinking of Loading File into one Dimension table. My solution is: Beam.read the file Create the side input from the DB about existing data. in a ParDo: filter the records which are already in the side input biquerySink into DB. and want to inquire if someone has implement this ? and can you gi...
EmmaYang
1

votes
1

answer
323

Views

How do you determine how many resources to provision in a Google Dataflow streaming pipeline?

I'm curious how to decide on how to provision resources for Apache Beam pipelines running on Google's Dataflow platform. I've built a streaming pipeline (Beam Java 2.0.0) that takes a PubSub JSON string, transforms it to a BQ TableRow, then routes it to the correct tables. There are also two transf...
SJAndersonLA
1

votes
2

answer
526

Views

How to use pubsub notifications for cloud storage to trigger dataflow pipeline

I'm trying to integrate a Google Cloud Dataflow pipeline with Google Cloud Pub/Sub Notifications for Google Cloud Storage. The idea is start processing a file as soon it is created. The messages are being published and with PubsubIO.readMessagesWithAttributes() source I manage to extract the file UR...
Fábio Uechi
1

votes
1

answer
666

Views

How to read files as byte[] in Apache Beam?

we are currently working on a proof of concept Apache Beam Pipeline on Cloud Dataflow. We put some files (no text; a custom binary format) into Google Cloud Buckets and would like to read these files as byte[] and deserialize them in the flow. However, we cannot find a Beam source that is able to re...
Simon
1

votes
2

answer
582

Views

Dataflow DynamicDestinations unable to serialize org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite

I am trying to use DynamicDestinations to write to a partitioned table in BigQuery where the partition name is mytable$yyyyMMdd. If I bypass dynamicdestinations and supply a hardcoded table name in .to(), it works; however, with dynamicdestinations I get the following exception: java.lang.IllegalAr...
Andrew Turner
1

votes
1

answer
257

Views

Processing with State and Timers

Are there any guidelines or limitations for using stateful processing and timers with the Beam Dataflow runner (as of v2.1.0)? Things such as limitations on the size of state or frequency of updates etc.? The candidate streaming pipeline would use state and timers extensively for user session state,...
Thomas
1

votes
2

answer
380

Views

Google Cloud dataflow : Read from a file with dynamic filename

I am trying to build a pipeline on Google Cloud Dataflow that would do the following: Listen to events on Pubsub subscription Extract the filename from event text Read the file (from Google Cloud Storage bucket) Store the records in BigQuery Following is the code: Pipeline pipeline = //create pipeli...
Darshan Mehta
1

votes
2

answer
266

Views

Convert EBCDIC to ASCII in Apache Beam

I am trying to convert EBCDIC file to ASCII using CobolIoProvider class from JRecord in Apache Beam. Code that I'm using: CobolIoProvider ioProvider = CobolIoProvider.getInstance(); AbstractLineReader reader = ioProvider.getLineReader(Constants.IO_FIXED_LENGTH, Convert.FMT_MAINFRAME,CopybookLoader....
rish0097
1

votes
1

answer
150

Views

Modifying Single BigQuery column and writing to new table

I would like to modify a single column in BigQuery and write the updated data to new table without having to manually keep all of the other columns by hand. I can accomplish what I would like to do with the following code: row = p | 'ReadFromBigQuery' >> beam.io.Read(beam.io.BigQuerySource(query=qu...
reese0106
1

votes
1

answer
70

Views

Apache Beam Pipeline (Dataflow) - Interpreting Execution Time for Unbounded Data

In the Dataflow Monitoring Interface for Beam Pipeline Executions, there is a time duration specified in each of the Transformation boxes (see https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf). For bounded data, I understood this is the estimated time it would take for the transf...
jlyh
1

votes
1

answer
241

Views

Registered Coder does not work on Dataflow

With Apache Beam SDK, registered coder doesn't work. I would like to use SimpleFunction with BigQuery's TableSchema but it needs to be serialized. I add TableSchemaCoder into CodeRegistry but it doesn't seem to be used. How can I solve it? // Coder import com.google.api.services.bigquery.model.Table...
nownabe
1

votes
1

answer
367

Views

Running external library with Cloud Dataflow

I'm trying to run some external shared library functions with Cloud Dataflow similar to described here: Running external libraries with Cloud Dataflow for grid-computing workloads. I have a couple of questions according to the approach. There is the following passage in the article mentioned earlier...
Petr Shevtsov
1

votes
1

answer
274

Views

Apache beam : Read from multiple subscriptions

I want to create a dataflow that listens to multiple subscriptions and writes to BigQuery. As per Google's documentation, I can read multiple PCollection objects and combine them together. However, looking at PubsubIO.Read's javadoc here, it seems subscription method accepts only one String. So, do...
Darshan Mehta
1

votes
1

answer
227

Views

Specifying schema update options for BigQueryIO in Apache Beam

Is it possible to specify the schema options for BigQueryIO in Apache Beam? Afaik one is normally able to use com.google.api.services.bigquery.model.JobConfigurationLoad to specify the the schema should be updated when new fields are inserted to a table using: JobConfigurationLoad loadConfig = new J...
Johan
1

votes
1

answer
426

Views

Google Cloud Dataflow Worker Threading

Say we have one worker with 4 CPU cores. How does parallelism configured in Dataflow worker machines? Do we parallelize beyond # of cores? Where would this type of information be available?
datauser
1

votes
3

answer
1.1k

Views

java.lang.ClassCastException: com.google.gson.internal.LinkedTreeMap cannot be cast to java.util.LinkedHashMap

I apologize for opening another question about this general issue, but none of the questions I've found on SO seem to relate closely to my issue. I've got an existing, working dataflow pipeline that accepts objects of KV and outputs TableRow objects. This code is in our production environment, runni...
Max
1

votes
1

answer
629

Views

Executing a sql query inside a dataflow using python on PCollection

I am trying to implement one sql query as a transform in dataflow. I loaded a table from bigquery as PCollection. I want to aggregate my data like below query. SELECT name, user_id, place, SUM(amount) as some_amount , SUM(cost) as sum_cost FROM [project:test.day_0_test] GROUP BY 1,2,3 How I can imp...
geek
1

votes
1

answer
334

Views

Exactly once from Kafka source in Apache Beam

Is it possible to do exactly-once processing using a Kafka source, with KafkaIO in Beam? There is a flag that can be set called AUTO_COMMIT, but it seems that it commits back to Kafka straight after the consumer processed the data, rather than after the pipeline completed processing the message.
jamborta

View additional questions