Dataflow TextIO.write issues with scaling


March 2019


3 time


I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise.

When kicking off the pipeline I hadn't set the number of workers (as I'd imagined that it would auto-scale as necessary). When ingesting this volume of data the number of workers stayed at 1, but the TextIO.write() was taking 15+ minutes to write a 2 minute window. This would continue to be backed up till it ran out of memory. Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?

When I increased the the number of workers to 6, the time to write the files started at around 4 mins for a 5 minute window, then moved down to as little as 20 seconds.

Also, when using 6 workers, it seems like there might be an issue for calculating wall time? Mine never seems to go down even when the dataflow has caught up and after running for 4 hours my summary for the write step looked like this:

Step summary
Step name: Write to output
System lag: 3 min 30 sec
Data watermark: Max watermark
Wall time: 1 day 6 hr 26 min 22 sec
Input collections: PT5M Windows/Window.Assign.out0
Elements added: 860,893
Estimated size: 582.11 GB

Job ID: 2019-03-13_19_22_25-14107024023503564121

0 answers