I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise.
When kicking off the pipeline I hadn't set the number of workers (as I'd imagined that it would auto-scale as necessary). When ingesting this volume of data the number of workers stayed at 1, but the TextIO.write() was taking 15+ minutes to write a 2 minute window. This would continue to be backed up till it ran out of memory. Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
When I increased the the number of workers to 6, the time to write the files started at around 4 mins for a 5 minute window, then moved down to as little as 20 seconds.
Also, when using 6 workers, it seems like there might be an issue for calculating wall time? Mine never seems to go down even when the dataflow has caught up and after running for 4 hours my summary for the write step looked like this:
Step summary Step name: Write to output System lag: 3 min 30 sec Data watermark: Max watermark Wall time: 1 day 6 hr 26 min 22 sec Input collections: PT5M Windows/Window.Assign.out0 Elements added: 860,893 Estimated size: 582.11 GB
Job ID: 2019-03-13_19_22_25-14107024023503564121