Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset weightage bug #185

Closed
yhl48 opened this issue Jun 26, 2024 · 2 comments · Fixed by #188
Closed

Dataset weightage bug #185

yhl48 opened this issue Jun 26, 2024 · 2 comments · Fixed by #188

Comments

@yhl48
Copy link
Contributor

yhl48 commented Jun 26, 2024

It doesn’t look like this line is doing what’s intended based on the comment, all datasets are given equal weightage here

self._weights = [1 / float(num_datasets)] * num_datasets

cc: @tchaton

@yhl48
Copy link
Contributor Author

yhl48 commented Jun 26, 2024

For context, I started looking into this code because I consistently got periodic sine-like validation loss when combining two datasets, but the issue disappeared when I manually shuffled and combined them into a single dataset. The two datasets have a 1:3 size ratio.

I am not sure if L74 is a bug, and if it is indeed a bug, I am not sure if correcting that would solve the validation loss issue. This could be due to the two datasets that I have being very different, prompting me to think about if there is a good sampling strategy that is universally acceptable. Anyways, I thought it would be good to have a discussion here 🙂.

@tchaton
Copy link
Collaborator

tchaton commented Jun 26, 2024

Oh yes, good catch, we should take the length of each dataset normalized by the total.

dataset_lens = [len(d) for d in datasets]
total = sum(dataset_lens)
self._weights = [l / total for l in dataset_lens]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants