Keras fit_generator only useful for data augmentation, or also for reading from disk(/network)


April 2019


195 time


Data might not fit in GPU-memory (including activations and gradients), for which one uses mini-batches, and it might not fit in RAM, for which one uses fit_generator. Or at least, the latter is my hypothesis I would like to validate here.

Is it true that Keras applies a producer-consumer strategy to first load the yielded elements of the generator into RAM until (or while not) queue_size is filled, and then keeps on filling it whenever batches are popped to train the network? The documentation mentions that this is useful to use the CPU for data augmentation and the GPU for training. Is the use case where this producer-consumer parallelism is used to load the data from disk, into RAM because it doesn't fit in RAM at once, also valid? My data has 100k CT-scans, which obviously do not fit in RAM.

Summarized, is fit_generator only to be used to parallelize data pre-processing and training, or can it also be (sensibly) used to parallelize data loading (to RAM) and training? Or would the latter be like using a hammer to get a screw in the wall?

1 answers


Yes, you can use it to parallelize data loading, that is one of the common use cases, for example when training with ImageNet-sized datasets.

But note that if you parallelize too much (too many workers) then usually IO becomes the bottleneck.