Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract, transform, and load data from court calendars #1

Closed
s2t2 opened this issue Feb 18, 2016 · 5 comments
Closed

Extract, transform, and load data from court calendars #1

s2t2 opened this issue Feb 18, 2016 · 5 comments
Assignees

Comments

@s2t2
Copy link
Member

s2t2 commented Feb 18, 2016

In order to access a comprehensive list of court hearings, given lack of expedient access to underlying systems which generate these statewide court calendar .pdf documents, the service should parse/scrape each .pdf document and store resulting data in the database.

These pdfs are updated daily, sometimes replacing previous versions at the same URL, and sometimes creating new urls. Example .pdfs include:

@s2t2
Copy link
Member Author

s2t2 commented Feb 18, 2016

example pdf content:

Salt Lake County Justice Court

 SHAUNA GRAVES-ROBERTSON February 17, 2016
 Judge Robertson Wednesday
 08:30 AM PRETRIAL CONFERENCE S18 151601829 Other Misdemeanor
 STATE OF UTAH ATTY:
 VS.
 AVILA, FLORENCIO ELVIS ATTY: FOWLER, AMY N
 OTN: DOB: 02/28/1978

 MB - RETAIL THEFT (SHOPLIFTING) - 12/12/15
 CITATION #: E10805208 SHERIFF #: LEA #: CO2015E10805208 E
 > NO OTN NUMBER <
 > CASE EFILED <
 ------------------------------------------------------------------------------
 PRETRIAL CONFERENCE S18 151601686 Other Misdemeanor
 STATE OF UTAH ATTY:
 VS.
 BEE, CHRISTOPHER ALLEN ATTY: FOWLER, AMY N
 OTN: 49028467 DOB: 12/02/1983

 MB - ASSAULT - 10/13/15
 IN - DISORDERLY CONDUCT - 10/13/15
 CITATION #: E10799580 SHERIFF #: LEA #: CO2015E10799580 E
 > CASE EFILED <
 ------------------------------------------------------------------------------
 ********* S18 031601507 Other Misdemeanor
 *********, ********* ATTY:
 VS.
 *********, ********* ATTY: CORUM, PATRICK W
 > CASE INVOLVES DOMESTIC VIOLENCE <
 CITATION #: J5533062 SHERIFF #: LEA #:
 > NO OTN NUMBER <
 ------------------------------------------------------------------------------
 ********* S18 051602452 Other Misdemeanor
 *********, ********* ATTY:
 VS.
 *********, ********* ATTY: SMITH, GREGORY B
 CITATION #: SHERIFF #: LEA #:
 > NO OTN NUMBER <
 ------------------------------------------------------------------------------
 BENCH WARRANT U00 155601786 Misdemeanor DUI
 STATE OF UTAH ATTY:
 VS.
 CHRISTENSEN, SHADEAU SCOTT ATTY:
 OTN: 48832117 DOB: 09/29/1993

 MB - IMPAIRED DRIVING - 05/02/15
 CITATION #: D122824774 SHERIFF #: LEA #: 15SL02833
 > CASE EFILED <
 > WARRANT OUTSTANDING <
 > FTA ISSUED <
 ------------------------------------------------------------------------------
Page 1 of 216
 Salt Lake County Justice Court

 SHAUNA GRAVES-ROBERTSON February 17, 2016
 Judge Robertson Wednesday
 08:30 AM PRETRIAL CONFERENCE B/W S18 151600992 Other Misdemeanor
 STATE OF UTAH ATTY:
 VS.
 CHRISTIANSEN, MIRRYAN JO ATTY: FOWLER, AMY N
 OTN: DOB: 06/19/1994

 MB - POSS <1 OZ MARIJUANA, SPICE, CONTROLLED
 SUBSTANCE - 07/05/15
 MB - USE OR POSSESSION OF DRUG PARAPHERNALIA - 07/05/15
 CITATION #: E10792181 SHERIFF #: LEA #: CO2015E10792181 E
 > NO OTN NUMBER <
 > CASE EFILED <
 > WARRANT OUTSTANDING <
 ------------------------------------------------------------------------------
 BENCH WARRANT U13 115600799 Misdemeanor DUI
 STATE OF UTAH ATTY:
 VS.
 FABER, ANDREW G ATTY: ORIFICI, JOSEPH F
 CAINE, NICHOLAS E
 OTN: 36417475 DOB: 03/04/1979

 MB - DRIVING UNDER THE INFLUENCE OF ALC/DRUGS - 12/30/10
 CITATION #: 11000497 SHERIFF #: 259765 LEA #: 10DI1455
 > WARRANT OUTSTANDING <
 > FTA ISSUED <
 ------------------------------------------------------------------------------
 BENCH WARRANT U13 105607280 Misdemeanor DUI
 STATE OF UTAH ATTY:
 VS.
 FINLAYSON, RYAN MALCOM ATTY: SINGLETON, LACEY C
 OTN: 34426353 DOB: 11/16/1988

 MB - IMPAIRED DRIVING - 07/18/10
 CITATION #: 10028792 SHERIFF #: 323261 LEA #: 10DI00965
 ------------------------------------------------------------------------------
 PRETRIAL CONFERENCE S18 155603938 Traffic Court Case
 STATE OF UTAH ATTY:
 VS.
 GHULAM-SARWAR, TOFIQ ATTY: FOWLER, AMY N
 OTN: 49056989 DOB: 05/31/1994

 MC - IGNITION INTERLOCK VIOLATION - 10/06/15
 MC - SPEEDING IN A SCHOOL ZONE - 32/20 - 10/06/15
 CITATION #: E10798943 SHERIFF #: LEA #: CO2015E10798943 E
 > CASE EFILED <
 > WARRANT OUTSTANDING <
 > FTA ISSUED <
 ------------------------------------------------------------------------------
Page 2 of 216

observations:

  • documents may contain schedule information for multiple days
  • the hearing date appears at the top-right corner of each page
  • the judge name appears at the top-left corner of each page
  • the court name appears at the top-center of each page
  • each hearing looks to be separated by a line of -------------
  • not all hearings have citation numbers
  • not all hearings associate the defendant with an offender tracking number. such occurances are marked with a flag/note of > NO OTN NUMBER <
  • our primary metric of interest is Failure to Appear (FTA). flags/notes about previous defendant FTAs appear as > FTA ISSUED <

@todrobbins
Copy link
Member

Probably a feature worth looking into would be diffs of the PDFs as you've noted they can and are often replaced by newer versions.

Here's a few libraries worth considering:

@s2t2
Copy link
Member Author

s2t2 commented Feb 18, 2016

thanks @todrobbins i will consider the diff strategy. i imagine it may have the potential to reduce processing time, if that becomes an issue.

@s2t2 s2t2 self-assigned this Feb 18, 2016
@s2t2 s2t2 added this to the Upcoming Hearings milestone Feb 18, 2016
@s2t2
Copy link
Member Author

s2t2 commented Feb 20, 2016

more observations:

  • some pages represent multiple court dates with no activity
  • most pages include the session start time for the first event on the page, but not for subsequent events (see below)

screenshot 2016-02-20 15 23 36

  • some pages have more than one session start time (see below)

screenshot 2016-02-20 15 17 41

  • some cases/events span across multiple pages (see below)

screenshot 2016-02-20 13 50 39

@s2t2
Copy link
Member Author

s2t2 commented Feb 20, 2016

debugging the extraction process:

SELECT 
  c.id AS court_id
  ,c.type AS court_type
  ,c.name AS court_name
  ,cal.id AS calendar_id
  -- ,cal.url AS calendar_url
  ,cal.created_at::DATE AS upload_date
  -- ,cal.modified_at::DATE AS calendar_modified_date
  -- ,cal.requested_at::DATE AS calendar_requested_date
  ,cal.page_count

  ,count(DISTINCT p.id) AS persisted_page_count
  ,count(DISTINCT p.jurisdiction) AS jurisdiction_count
  ,max(p.number) AS max_page_number

  ,count(DISTINCT court_day) AS day_count
  ,count(DISTINCT court_date) AS date_count

  ,count(DISTINCT judge_name) AS judge_count
  ,count(DISTINCT court_room) AS room_count

FROM utah_courts c
LEFT JOIN utah_court_calendars cal ON cal.utah_court_id = c.id
LEFT JOIN utah_court_calendar_pages p ON p.utah_court_calendar_id = cal.id
-- WHERE cal.created_at <> cal.modified_at -- zero rows
GROUP BY 1,2,3,4,5,6
HAVING cal.page_count <> count(DISTINCT p.id) -- IDENTIFIES PDF FILES WHICH FAIL THE TEST
ORDER BY cal.id

=>

court_id court_type court_name calendar_id upload_date page_count persisted_page_count jurisdiction_count max_page_number day_count date_count judge_count room_count
29 DistrictCourt Salt Lake City 15 2016-02-19 1372 1370 1 1372 5 11 40 35
38 JusticeCourt Aurora 20 2016-02-17 1 0 0 NULL 0 0 0 0
37 JusticeCourt Alta 74 2016-02-17 1 0 0 NULL 0 0 0 0
53 JusticeCourt Enterprise 82 2016-02-17 1 0 0 NULL 0 0 0 0
55 JusticeCourt Escalante 83 2016-02-17 1 0 0 NULL 0 0 0 0
112 JusticeCourt Payson 88 2016-02-17 52 51 1 52 3 10 1 1
137 JusticeCourt Spring City 101 2016-02-17 1 0 0 NULL 0 0 0 0
139 JusticeCourt Stockton 102 2016-02-17 1 0 0 NULL 0 0 0 0
73 JusticeCourt Hildale 117 2016-02-17 1 0 0 NULL 0 0 0 0
154 JusticeCourt Wayne County 157 2016-02-17 1 0 0 NULL 0 0 0 0

see also #1 and #2 for resolution

s2t2 pushed a commit that referenced this issue Feb 22, 2016
…oses #1 with the promise for more data validation
@s2t2 s2t2 closed this as completed in def0b58 Feb 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@todrobbins @s2t2 and others