Skip to content
bouvard edited this page Sep 13, 2010 · 25 revisions

Votersdaily uses CouchDB and includes two databases—vd_events, which contains the parsed events and vd_log, which contains records of scraping attempts. The following tables show the defined keys (de facto schema) for these two databases. Any data relevant to an event that is not captured by a defined key may be stored in additional ad hoc keys. Unless otherwise specified, a value may not be null.

vd_events

Key Value
datetime the start date/time of the event
title the title of the event
description a description of the event, may include attendees, location, etc. (or null)
end_datetime the date/time that the event ends (or null, see note below)
branch the branch of government producing the event (e.g. “Legislative”)
entity the entity producing the event (e.g. “House of Representatives”)
source_url the url this event was scraped from
source_text the block of text/HTML/XML from which this event was scraped (e.g. the innerhtml property of a div, tr, etc. which encloses all the event’s details)
access_datetime the date/time that the source was scraped
parser_name the name of the parser that scraped this event
parser_version the version of the parser that scraped this event

For the end_datetime, description, and any other fields where null is allowed, it is used to indicate that no data is provided which is appropriate to that field. An empty string in these fields indicates that there is normally data provided for that field, but for whatever reason (data entry error, incomplete source data, etc) the individual item has no data.

vd_log

Key Value
parser_name the name of the parser that was run
parser_version the version of the parser that was run
parser_runtime the amount of time it took the parser to run, in seconds
source_url the url that was accessed
source_text the complete text/HTML/XML/whatever that was retrieved from that URL
access_datetime the date/time that the source was accessed
result the result of the parser run (either “success” or the name of the exception that ended the process)
traceback the traceback of any exception that was thrown (on error only)
Clone this wiki locally