Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问common crawl数据的年份是?有做句子级的清洗吗? #15

Open
guang11644331 opened this issue Sep 12, 2023 · 1 comment

Comments

@guang11644331
Copy link

请问是如何从html中提取的,以及文档内部句子级是如何清洗的呢?谢谢!

@guang11644331 guang11644331 changed the title 请问common crawl数据的年份是?有做句子级的清洗? 请问common crawl数据的年份是?有做句子级的清洗吗? Sep 12, 2023
@ZhouqyCH
Copy link

有同样的问题,是否和llama预训练中的common crawl数据的年份相同?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants