Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All scrapers should set document id's in the form [datetime] - [parser_name] - [unique key] #58

Open
onyxfish opened this issue Aug 22, 2009 · 5 comments

Comments

@onyxfish
Copy link
Owner

Where unique keys is whatever is appropriate to a given scraper. For Roll Call Vote scrapers this would be Roll #. For some scrapers this may be title--whatever makes a given event unique.

@onyxfish
Copy link
Owner Author

This has now been documented in the Database Planning section of the wiki:
http://wiki.github.com/bouvard/votersdaily/database-planning

@onyxfish
Copy link
Owner Author

Fixed for Python scrapers. This is def. a much better way of identifying each document.

@chaunceyt
Copy link
Collaborator

fixed closing.

@onyxfish
Copy link
Owner Author

It looks like the scrapers are still pulling in branch and entity names in the format: [datetime] - [parser_name] - [branch] - [entity] - [unique key]. Now that we are including parser name I think we should remove [branch] and [entity]. They really only make the id's longer and I'm already a bit concerned that some of our URL's are going to be overly lengthy.

Also, for the Roll Call Votes scrapers where there is a unique Vote Number, I really think we want to use that as the [unique key] portion rather than the title.

Going to reopen this ticket, pending discussion.

@chaunceyt
Copy link
Collaborator

will work on this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants