New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

请问common crawl数据的年份是？有做句子级的清洗吗？ #15

Open

guang11644331 opened this issue Sep 12, 2023 · 1 comment

guang11644331 commented Sep 12, 2023

请问是如何从html中提取的，以及文档内部句子级是如何清洗的呢？谢谢！

guang11644331 changed the title ~~请问common crawl数据的年份是？有做句子级的清洗？~~ 请问common crawl数据的年份是？有做句子级的清洗吗？

ZhouqyCH commented Nov 28, 2023

有同样的问题，是否和llama预训练中的common crawl数据的年份相同？

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment