Dataset weightage bug #185

yhl48 · 2024-06-26T15:20:52Z

It doesn’t look like this line is doing what’s intended based on the comment, all datasets are given equal weightage here

litdata/src/litdata/streaming/combined.py

Line 74 in d5eff39

self._weights = [1 / float(num_datasets)] * num_datasets

cc: @tchaton

yhl48 · 2024-06-26T15:54:40Z

For context, I started looking into this code because I consistently got periodic sine-like validation loss when combining two datasets, but the issue disappeared when I manually shuffled and combined them into a single dataset. The two datasets have a 1:3 size ratio.

I am not sure if L74 is a bug, and if it is indeed a bug, I am not sure if correcting that would solve the validation loss issue. This could be due to the two datasets that I have being very different, prompting me to think about if there is a good sampling strategy that is universally acceptable. Anyways, I thought it would be good to have a discussion here 🙂.

tchaton · 2024-06-26T19:02:17Z

Oh yes, good catch, we should take the length of each dataset normalized by the total.

dataset_lens = [len(d) for d in datasets]
total = sum(dataset_lens)
self._weights = [l / total for l in dataset_lens]

tchaton mentioned this issue Jun 27, 2024

Fix: Resolve the default weights of the combined dataset #188

Merged

4 tasks

tchaton closed this as completed in #188 Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset weightage bug #185

Dataset weightage bug #185

yhl48 commented Jun 26, 2024

yhl48 commented Jun 26, 2024 •

edited

Loading

tchaton commented Jun 26, 2024 •

edited

Loading

Dataset weightage bug #185

Dataset weightage bug #185

Comments

yhl48 commented Jun 26, 2024

yhl48 commented Jun 26, 2024 • edited Loading

tchaton commented Jun 26, 2024 • edited Loading

yhl48 commented Jun 26, 2024 •

edited

Loading

tchaton commented Jun 26, 2024 •

edited

Loading