Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export categorized text corpus #13

Open
turicas opened this issue May 8, 2013 · 0 comments
Open

Export categorized text corpus #13

turicas opened this issue May 8, 2013 · 0 comments

Comments

@turicas
Copy link

turicas commented May 8, 2013

NLTK supports categorized corpora: a method categories is created in the corpus object and all other methods like sents, paras etc. accept a categories option to filter by category.
We should export this corpus as a categorized corpus where categories are Wikipedia Portals. This task includes:

  • Retrieving portal information from Portuguese Wikipedia (which page belongs to which portal?);
  • Create a custom corpus reader (or modify an existing one) to export.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant