-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset weightage bug #185
Comments
For context, I started looking into this code because I consistently got periodic sine-like validation loss when combining two datasets, but the issue disappeared when I manually shuffled and combined them into a single dataset. The two datasets have a 1:3 size ratio. I am not sure if L74 is a bug, and if it is indeed a bug, I am not sure if correcting that would solve the validation loss issue. This could be due to the two datasets that I have being very different, prompting me to think about if there is a good sampling strategy that is universally acceptable. Anyways, I thought it would be good to have a discussion here 🙂. |
Oh yes, good catch, we should take the length of each dataset normalized by the total. dataset_lens = [len(d) for d in datasets]
total = sum(dataset_lens)
self._weights = [l / total for l in dataset_lens] |
It doesn’t look like this line is doing what’s intended based on the comment, all datasets are given equal weightage here
litdata/src/litdata/streaming/combined.py
Line 74 in d5eff39
cc: @tchaton
The text was updated successfully, but these errors were encountered: