Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to specify whether to only shuffle partition order or data within partitions #460

Open
MaxiBoether opened this issue Jun 3, 2024 · 0 comments

Comments

@MaxiBoether
Copy link
Contributor

In #456, we introduce a shuffle pipeline parameter which both shuffles the order of partitions within workers and the data within partitions for more randomness. However, shuffling within a partition requires us to fetch the entire partition first before yielding to the data loader. We need to investigate the performance overhead of this when running Criteo and CLOC. We should offer the option to specify different variants of shuffling instead of having a boolean option between shuffling as much as possible and not shuffling at all.

When implementing more lightweight shuffling, we could think about supporting shuffling on a file level at the storage. We can order by file id but then handle the files randomly. For single sample files, this would have the same effect as buffering the entire partition but would allow the early-yield logic.

MaxiBoether added a commit that referenced this issue Jun 3, 2024
This PR introduces a `shuffle` option for training: If `True`, then we
shuffle the order of the partitions and the keys within the partitions
between each epoch.

Note that as described in #460, we might need to have this a bit more
finegrained for things like Criteo to optimize performance.
robinholzi pushed a commit that referenced this issue Jun 4, 2024
This PR introduces a `shuffle` option for training: If `True`, then we
shuffle the order of the partitions and the keys within the partitions
between each epoch.

Note that as described in #460, we might need to have this a bit more
finegrained for things like Criteo to optimize performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant