Allow users to specify whether to only shuffle partition order or data within partitions #460

MaxiBoether · 2024-06-03T07:57:41Z

In #456, we introduce a shuffle pipeline parameter which both shuffles the order of partitions within workers and the data within partitions for more randomness. However, shuffling within a partition requires us to fetch the entire partition first before yielding to the data loader. We need to investigate the performance overhead of this when running Criteo and CLOC. We should offer the option to specify different variants of shuffling instead of having a boolean option between shuffling as much as possible and not shuffling at all.

When implementing more lightweight shuffling, we could think about supporting shuffling on a file level at the storage. We can order by file id but then handle the files randomly. For single sample files, this would have the same effect as buffering the entire partition but would allow the early-yield logic.

The text was updated successfully, but these errors were encountered:

This PR introduces a `shuffle` option for training: If `True`, then we shuffle the order of the partitions and the keys within the partitions between each epoch. Note that as described in #460, we might need to have this a bit more finegrained for things like Criteo to optimize performance.

MaxiBoether mentioned this issue Jun 3, 2024

feat: Shuffle between epochs #456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow users to specify whether to only shuffle partition order or data within partitions #460

Allow users to specify whether to only shuffle partition order or data within partitions #460

MaxiBoether commented Jun 3, 2024

Allow users to specify whether to only shuffle partition order or data within partitions #460

Allow users to specify whether to only shuffle partition order or data within partitions #460

Comments

MaxiBoether commented Jun 3, 2024