Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

05 participants #26

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

05 participants #26

wants to merge 20 commits into from

Conversation

tgj505
Copy link
Collaborator

@tgj505 tgj505 commented Jul 20, 2023

A draft for the task in #4 . Three main places need attention:

  • task.py is not currently using the error table implemented in Move error logging out of cases table #22 .
  • the parsing itself is challenging since the data are so messy; most notably, determining the name and organization for a given case is often ambiguous.
  • testing could be more robust.

@tgj505 tgj505 requested review from akgerber, neverett and eenblam August 3, 2023 22:24
Copy link
Collaborator

@neverett neverett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some feedback, and i assume you'll be adding to & cleaning up the testing code a bit more, but overall this looks great, ty!

AND c.participants_raw <> ''
AND e.participants_parse_error IS NULL
OR e.participants_parse_error = true
limit 1000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this limit in place for testing? if so, you'll want to remove it before merging

c = cnx.cursor()
c.execute("select count(*) from pages;")
t = time.time() - t1
part_rate = round((n - c.rowcount) / t, 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlite3 connection objects don't have a rowcount attribute, so you'll need to check the db type here, similar to line 33 above

def html_raw_participants(html_str: str) -> list:
"""
Reads in an html string from the `raw_text` column in the `pages` table,
finds the participants table, collects the rows in the table,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor suggestion, but you may want to distinguish between database and HTML tables in comments for clarity, e.g. "Reads in an html string from the raw_text column in the pages database table, finds the participants HTML table in the string, collects the rows in the HTML table...."

def add_participant_row(case_id: int, r: list):
# insert relevant info to participants table in the db
try:
if db_config.db_type == "sqlite":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if/elif block that sets the query can be moved above and outside of the try/except block, since it's unlikely to cause an exception


def main():
participants_query = """
SELECT c.id as case_id, c.case_number, c.participants_raw, e.participants_parse_error, p.raw_text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting a note in after our conversation today: we probably only need the raw_text column from the pages column, so references to the cases table and participants_raw column can be removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up changing this query. Instead of referencing the cases table, it now selects only from the pages table where the html in the raw_text column contains a participants table element.

@tgj505
Copy link
Collaborator Author

tgj505 commented Sep 30, 2023

I've made some revisions to most aspects of the task, especially in the testing process. I ran this on the full pages table and it seemed to work as expected. A few issues:

  • it's still slow! I'm clocking in at ~30-70 iterations/second for parses, so it takes an hour and a half to run over the full collection of 400,000+ cases.
  • it's not really writing to the error_log table. I'm not sure what would be erroring out that we'd want to catch.
  • the testing is very basic.
  • related to the basic testing, the actual html parsing step could potentially be clarified, at least somewhat.

Here's a subsample of the pages table that you can test this branch on.
pages_dump.zip

@tgj505 tgj505 marked this pull request as ready for review September 30, 2023 18:42
@tgj505 tgj505 changed the title [WIP] 05 participants 05 participants Sep 30, 2023
@tgj505 tgj505 requested a review from neverett September 30, 2023 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants