-
Notifications
You must be signed in to change notification settings - Fork 3
Database Planning
Votersdaily uses CouchDB and includes two databases—vd_events, which contains the parsed events and vd_log, which contains records of scraping attempts. The following tables show the defined keys (de facto schema) for these two databases. Any data relevant to an event that is not captured by a defined key may be stored in additional ad hoc keys. Unless otherwise specified, a value may not be null.
Key | Type | Value |
---|---|---|
datetime | string 1 | the start date/time of the event |
title | string | the title of the event |
description | string or null 2 | a description of the event, may include attendees, location, etc. (or null) |
end_datetime | string 1 or null 2 | the date/time that the event ends (or null, see note below) |
branch | string 3 | the branch of government producing the event (e.g. “Legislative”) |
entity | string 4 | the entity producing the event (e.g. “House of Representatives”) |
source_url | string | the url this event was scraped from |
source_text | string | the block of text/HTML/XML from which this event was scraped (e.g. the innerhtml property of a div, tr, etc. which encloses all the event’s details) |
access_datetime | string 1 | the date/time that the source was scraped |
parser_name | string | the name of the parser that scraped this event |
parser_version | string | the version of the parser that scraped this event |
Key | Type | Value |
---|---|---|
parser_name | string | the name of the parser that was run |
parser_version | string | the version of the parser that was run |
parser_runtime | float | the amount of time it took the parser to run, in seconds |
source_url | string | the url that was accessed |
source_text | string | the complete text/HTML/XML/whatever that was retrieved from that URL |
access_datetime | string 1 | the date/time that the source was accessed |
result | string | the result of the parser run (either “success” or the name of the exception that ended the process) |
traceback | string | the traceback of any exception that was thrown (on error only) |
1 All datetimes should be encoded as ISO 8601 UTC with accuracy to the second; e.g. “2009-08-19T13:43:21Z”. If an event does not have a time component (e.g. “on december 13th”) then those values should be 0, as in this example: “2009-08-19T00:00:00Z”. (Be sure you convert your sources timezone to UTC!)
2 For any field where null is allowed, it is used to indicate that no data is provided which is appropriate to that field. An empty string in these fields indicates that there is normally data provided for that field, but for whatever reason (data entry error, malformed source, etc) that individual item has no data.
3 See Standard Branch Names.
4 See Standard Entity Names.