Usage of Pile dataset to train the emulator #10

ziqi-zhang · 2023-12-14T03:45:51Z

Hi,

I noticed that you trained the NLP emulator with the first 30 chunks of Pile dataset. I wonder how large are the 30 chunks? Or in other words, how many chunks does Pile have? The original Pile dataset is over 800G, it is too big for the labs...

Besides, did you try to use smaller datasets, such as Wikitext? What is the performance of using these smaller datasets?

Thanks

krishnakanthnakkav2 · 2024-03-07T13:07:26Z

Hello @ziqi-zhang,

May I ask if you were able to train on a smaller dataset for emulator distillation? If so, how was the method's performance in the case when distilled on smaller datasets? Any insights will be helpful for understanding the proposed algorithm better.

Thanks

zhaocaibei123 · 2024-09-12T08:47:08Z

I have the same question

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage of Pile dataset to train the emulator #10

Usage of Pile dataset to train the emulator #10

ziqi-zhang commented Dec 14, 2023

krishnakanthnakkav2 commented Mar 7, 2024

zhaocaibei123 commented Sep 12, 2024

Usage of Pile dataset to train the emulator #10

Usage of Pile dataset to train the emulator #10

Comments

ziqi-zhang commented Dec 14, 2023

krishnakanthnakkav2 commented Mar 7, 2024

zhaocaibei123 commented Sep 12, 2024