-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Shuffle between epochs #456
Conversation
✅ Result of Pytest Coverage---------- coverage: platform linux, python 3.12.3-final-0 -----------
|
This touches a few files but mostly propagating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review everything (the comments now are just nits), but I think I conceptually understand the logic:
We shuffle inter-partition in online_dataset.py
by shuffle the partition ids, and shuffle intra-partition on selector side by shuffling data within partition, or directly shuffling data if it is the local key store. Correct?
Right now I have two questions:
- For
SelectorKeySource
, why do we have to shuffle on server (sender) side? Can we shuffle on receiver/client side, like what we did inLocalKeySource
, to reduce the radius of changes? - In
online_dataset.py
, there are two pathsprefetched_partition_generator
and_fetch_partition_noprefetch
to yield data, why is it enough to only apply shuffling in the second path?
re your questions in the main comment:
|
@XianzheMa: I will have to remove
probably I will have to change the logic such that if shuffling is enabled, we wait for the entire partition to be transferred and then shuffle it. i hope this will not affect throughput too much. at some point i can measure this, but it will mostly affect things like criteo. i will implement this tomorrow morning such that we can then merge this PR and you can use the shuffling functionality. sorry i did not realize that shuffling is not straightforward due to the storage implementation |
I see. So even if we shuffle the keys, the storage component still return samples not in the key order, but in timestamp order, is this correct? Do you remember why we want to sort samples in the timestamp order in the first place (instead of naturally returning samples in key order)? |
No, we don't sort by timestamp:
That function implements the sending of results. For performance reasons, we iterate over files (not by timestamp!) in a request. This is because if one file contains many samples, we just want to open and load it into memory once and then read the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! My comments are non-critical. The most concern is, shouldn't we add unit test coverage to the online_dataset.py
file?
Because, the integration test does not cover everything, for example as you commented, even when shuffling == False, the order of samples can be slightly different.
modyn/tests/trainer_server/internal/data/test_online_dataset.py
Outdated
Show resolved
Hide resolved
This PR introduces a `shuffle` option for training: If `True`, then we shuffle the order of the partitions and the keys within the partitions between each epoch. Note that as described in #460, we might need to have this a bit more finegrained for things like Criteo to optimize performance.
This PR introduces a
shuffle
option for training: IfTrue
, then we shuffle the order of the partitions and the keys within the partitions between each epoch.Note that as described in #460, we might need to have this a bit more finegrained for things like Criteo to optimize performance.