You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #456, we introduce a shuffle pipeline parameter which both shuffles the order of partitions within workers and the data within partitions for more randomness. However, shuffling within a partition requires us to fetch the entire partition first before yielding to the data loader. We need to investigate the performance overhead of this when running Criteo and CLOC. We should offer the option to specify different variants of shuffling instead of having a boolean option between shuffling as much as possible and not shuffling at all.
When implementing more lightweight shuffling, we could think about supporting shuffling on a file level at the storage. We can order by file id but then handle the files randomly. For single sample files, this would have the same effect as buffering the entire partition but would allow the early-yield logic.
The text was updated successfully, but these errors were encountered:
This PR introduces a `shuffle` option for training: If `True`, then we
shuffle the order of the partitions and the keys within the partitions
between each epoch.
Note that as described in #460, we might need to have this a bit more
finegrained for things like Criteo to optimize performance.
This PR introduces a `shuffle` option for training: If `True`, then we
shuffle the order of the partitions and the keys within the partitions
between each epoch.
Note that as described in #460, we might need to have this a bit more
finegrained for things like Criteo to optimize performance.
In #456, we introduce a
shuffle
pipeline parameter which both shuffles the order of partitions within workers and the data within partitions for more randomness. However, shuffling within a partition requires us to fetch the entire partition first before yielding to the data loader. We need to investigate the performance overhead of this when running Criteo and CLOC. We should offer the option to specify different variants of shuffling instead of having a boolean option between shuffling as much as possible and not shuffling at all.When implementing more lightweight shuffling, we could think about supporting shuffling on a file level at the storage. We can order by file id but then handle the files randomly. For single sample files, this would have the same effect as buffering the entire partition but would allow the early-yield logic.
The text was updated successfully, but these errors were encountered: