05 participants #26

tgj505 · 2023-07-20T01:27:55Z

A draft for the task in #4 . Three main places need attention:

task.py is not currently using the error table implemented in Move error logging out of cases table #22 .
the parsing itself is challenging since the data are so messy; most notably, determining the name and organization for a given case is often ambiguous.
testing could be more robust.

…ter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit. Updating branch to match latest state of project.

neverett

left some feedback, and i assume you'll be adding to & cleaning up the testing code a bit more, but overall this looks great, ty!

neverett · 2023-08-16T20:50:42Z

tasks/05_participants/task.py

+  AND c.participants_raw <> ''
+  AND e.participants_parse_error IS NULL
+  OR e.participants_parse_error = true
+  limit 1000;


was this limit in place for testing? if so, you'll want to remove it before merging

neverett · 2023-08-16T20:53:45Z

tasks/05_participants/task.py

+        c = cnx.cursor()
+        c.execute("select count(*) from pages;")
+        t = time.time() - t1
+        part_rate = round((n - c.rowcount) / t, 2)


sqlite3 connection objects don't have a rowcount attribute, so you'll need to check the db type here, similar to line 33 above

neverett · 2023-08-16T20:56:52Z

tasks/05_participants/participants.py

+def html_raw_participants(html_str: str) -> list:
+    """
+    Reads in an html string from the `raw_text` column in the `pages` table,
+    finds the participants table, collects the rows in the table,


very minor suggestion, but you may want to distinguish between database and HTML tables in comments for clarity, e.g. "Reads in an html string from the raw_text column in the pages database table, finds the participants HTML table in the string, collects the rows in the HTML table...."

neverett · 2023-08-16T21:02:07Z

tasks/05_participants/participants.py

+def add_participant_row(case_id: int, r: list):
+    # insert relevant info to participants table in the db
+    try:
+        if db_config.db_type == "sqlite":


this if/elif block that sets the query can be moved above and outside of the try/except block, since it's unlikely to cause an exception

neverett · 2023-08-18T20:53:28Z

tasks/05_participants/task.py

+
+def main():
+    participants_query = """
+    SELECT c.id as case_id, c.case_number, c.participants_raw, e.participants_parse_error, p.raw_text


putting a note in after our conversation today: we probably only need the raw_text column from the pages column, so references to the cases table and participants_raw column can be removed

I ended up changing this query. Instead of referencing the cases table, it now selects only from the pages table where the html in the raw_text column contains a participants table element.

tgj505 · 2023-09-30T18:42:29Z

I've made some revisions to most aspects of the task, especially in the testing process. I ran this on the full pages table and it seemed to work as expected. A few issues:

it's still slow! I'm clocking in at ~30-70 iterations/second for parses, so it takes an hour and a half to run over the full collection of 400,000+ cases.
it's not really writing to the error_log table. I'm not sure what would be erroring out that we'd want to catch.
the testing is very basic.
related to the basic testing, the actual html parsing step could potentially be clarified, at least somewhat.

Here's a subsample of the pages table that you can test this branch on.
pages_dump.zip

tgj505 added 12 commits June 17, 2023 19:10

reconciling 05 task from earlier draft

c4f0ed7

participants html parsing drafting

ff61ba6

working on parser

d3956ab

adding participants.sql

5886d7e

participants parsing full draft

c8d85d3

formatting and tidying 05_ task

1fa6ee3

formatting and refactoring task scripts

652693d

threading task trial

84acd99

revising db connection in task

c77d796

drafting participants.py and task.py

a1f271b

05_participants parsing draft

b6f3f8d

tgj505 requested review from akgerber, neverett and eenblam August 3, 2023 22:24

neverett reviewed Aug 16, 2023

View reviewed changes

neverett reviewed Aug 18, 2023

View reviewed changes

tgj505 added 8 commits September 9, 2023 10:17

participants and task work. update clean.py

0762b9f

drafting participants, post, task

92c8f79

updating full 05 task and relevant participants.sql files

d3028eb

formatting and linting fixes

8bfdb4f

refacotring functions in participants.py and related files

1089941

proofreading

fb2a217

proofreading

57128d8

participants.py proofing

dd59dec

tgj505 marked this pull request as ready for review September 30, 2023 18:42

tgj505 changed the title ~~[WIP] 05 participants~~ 05 participants Sep 30, 2023

tgj505 requested a review from neverett September 30, 2023 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05 participants #26

05 participants #26

tgj505 commented Jul 20, 2023

neverett left a comment

neverett Aug 16, 2023

neverett Aug 16, 2023

neverett Aug 16, 2023

neverett Aug 16, 2023

neverett Aug 18, 2023

tgj505 Sep 30, 2023

tgj505 commented Sep 30, 2023

05 participants #26

Are you sure you want to change the base?

05 participants #26

Conversation

tgj505 commented Jul 20, 2023

neverett left a comment

Choose a reason for hiding this comment

neverett Aug 16, 2023

Choose a reason for hiding this comment

neverett Aug 16, 2023

Choose a reason for hiding this comment

neverett Aug 16, 2023

Choose a reason for hiding this comment

neverett Aug 16, 2023

Choose a reason for hiding this comment

neverett Aug 18, 2023

Choose a reason for hiding this comment

tgj505 Sep 30, 2023

Choose a reason for hiding this comment

tgj505 commented Sep 30, 2023