Database Planning

Votersdaily uses CouchDB and includes two databases—vd_events, which contains the parsed events and vd_log, which contains records of scraping attempts. The following tables show the defined keys (de facto schema) for these two databases. Any data relevant to an event that is not captured by a defined key may be stored in additional ad hoc keys. Unless otherwise specified, a value may not be null.

vd_events

Id’s for events should be generated in the format datetime – parser_name – unique_key where unique_key is whatever field in the document makes it unique amongst all others that are scraped from the same source. For pages where every event has a unique id—such as a vote number—that field should be used. For many pages the best fit may be the title field. Whatever value is used, it should also be available as a field in the document body—either one of the required fields or an ad hoc field. Combining fields to generate a unique key is allowed, but discouraged.

Key	Type	Value
datetime	string ¹	the start date/time of the event
title	string	the title of the event
description	string or null ²	a description of the event, may include attendees, location, etc. (or null)
end_datetime	string ¹ or null ²	the date/time that the event ends (or null, see note below)
branch	string ³	the branch of government producing the event (e.g. “Legislative”)
entity	string ⁴	the entity producing the event (e.g. “House of Representatives”)
source_url	string	the url this event was scraped from
source_text	string	the block of text/HTML/XML from which this event was scraped (e.g. the innerhtml property of a div, tr, etc. which encloses all the event’s details)
access_datetime	string ¹	the date/time that the source was scraped
parser_name	string	the name of the parser that scraped this event
parser_version	string	the version of the parser that scraped this event
event_url	string or null ²	the url to a page uniquely associated with this event, if available (a “link back”)
source_timezone	string	the timezone of the event locale in “America/New York” format

vd_logs

Id’s for logs should be generated in the format access_datetime – parser_name – result.

Key	Type	Value
parser_name	string	the name of the parser that was run
parser_version	string	the version of the parser that was run
parser_runtime	float	the amount of time it took the parser to run, in seconds
source_url	string	the url that was accessed
source_text	string	the complete text/HTML/XML/whatever that was retrieved from that URL
access_datetime	string ¹	the date/time that the source was accessed
insert_count	int	the number of (new) documents inserted into the database on this run
result	string	the result of the parser run (either “success” or the name of the exception that ended the process)
traceback	string	the traceback of any exception that was thrown (on error only)

Footnotes

¹ All datetimes should be encoded as ISO 8601 UTC with accuracy to the second; e.g. “2009-08-19T13:43:21Z”. If an event does not have a time component (e.g. “on december 13th”) then those values should be 0, as in this example: “2009-08-19T00:00:00Z”. (Be sure you convert your sources timezone to UTC!)

² For any field where null is allowed, it is used to indicate that no data is provided which is appropriate to that field. An empty string in these fields indicates that there is normally data provided for that field, but for whatever reason (data entry error, malformed source, etc) that individual item has no data.

³ See Standard Branch Names.

⁴ See Standard Entity Names.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Planning

vd_events

vd_logs

Footnotes

Clone this wiki locally