-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract, transform, and load data from court calendars #1
Comments
example pdf content:
observations:
|
Probably a feature worth looking into would be diffs of the PDFs as you've noted they can and are often replaced by newer versions. Here's a few libraries worth considering: |
thanks @todrobbins i will consider the diff strategy. i imagine it may have the potential to reduce processing time, if that becomes an issue. |
more observations:
|
debugging the extraction process: SELECT
c.id AS court_id
,c.type AS court_type
,c.name AS court_name
,cal.id AS calendar_id
-- ,cal.url AS calendar_url
,cal.created_at::DATE AS upload_date
-- ,cal.modified_at::DATE AS calendar_modified_date
-- ,cal.requested_at::DATE AS calendar_requested_date
,cal.page_count
,count(DISTINCT p.id) AS persisted_page_count
,count(DISTINCT p.jurisdiction) AS jurisdiction_count
,max(p.number) AS max_page_number
,count(DISTINCT court_day) AS day_count
,count(DISTINCT court_date) AS date_count
,count(DISTINCT judge_name) AS judge_count
,count(DISTINCT court_room) AS room_count
FROM utah_courts c
LEFT JOIN utah_court_calendars cal ON cal.utah_court_id = c.id
LEFT JOIN utah_court_calendar_pages p ON p.utah_court_calendar_id = cal.id
-- WHERE cal.created_at <> cal.modified_at -- zero rows
GROUP BY 1,2,3,4,5,6
HAVING cal.page_count <> count(DISTINCT p.id) -- IDENTIFIES PDF FILES WHICH FAIL THE TEST
ORDER BY cal.id =>
|
s2t2
pushed a commit
that referenced
this issue
Feb 22, 2016
…oses #1 with the promise for more data validation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In order to access a comprehensive list of court hearings, given lack of expedient access to underlying systems which generate these statewide court calendar .pdf documents, the service should parse/scrape each .pdf document and store resulting data in the database.
These pdfs are updated daily, sometimes replacing previous versions at the same URL, and sometimes creating new urls. Example .pdfs include:
The text was updated successfully, but these errors were encountered: