Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of Pile dataset to train the emulator #10

Open
ziqi-zhang opened this issue Dec 14, 2023 · 2 comments
Open

Usage of Pile dataset to train the emulator #10

ziqi-zhang opened this issue Dec 14, 2023 · 2 comments

Comments

@ziqi-zhang
Copy link

Hi,

I noticed that you trained the NLP emulator with the first 30 chunks of Pile dataset. I wonder how large are the 30 chunks? Or in other words, how many chunks does Pile have? The original Pile dataset is over 800G, it is too big for the labs...

Besides, did you try to use smaller datasets, such as Wikitext? What is the performance of using these smaller datasets?

Thanks

@krishnakanthnakkav2
Copy link

Hello @ziqi-zhang,

May I ask if you were able to train on a smaller dataset for emulator distillation? If so, how was the method's performance in the case when distilled on smaller datasets? Any insights will be helpful for understanding the proposed algorithm better.

Thanks

@zhaocaibei123
Copy link

I have the same question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants