Make orgs classification script into more well-defined pipeline #92

trevorspreadbury · 2024-04-11T14:20:58Z

https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/blob/9db429d8f209843075b787b08dc4dded5f71a787/src/utils/orgs_classification_data_pipeline.py#L2

This seems like a good direction, but the exact purpose of this file is unclear/undocumented right now. My guess is the eventual idea is that we are combining a collection of raw data files with company information into a single csv with a well-defined schema that has details and classifications for companies. Convert this into functions and define that output schema (and it would be a good idea to do this with record linkage in mind).

Additionally right now work on these raw files is split between this and the EDA folder. The EDA folder is fine for EDA now, but shouldn't be part of final production pipeline. Move all the code for processing the raw data files into this pipeline. Then provide details in the data readme for where you retrieved each of these files.

Since the InfoGroup/DataAxel data is copywritten we can't make it publicly available in large chunks. In any case, these CSVs will grow quite large so we don't want them in the repository. Add a link to the output file of the pipeline in google drive

trevorspreadbury assigned klee2024 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make orgs classification script into more well-defined pipeline #92

Make orgs classification script into more well-defined pipeline #92

trevorspreadbury commented Apr 11, 2024

Make orgs classification script into more well-defined pipeline #92

Make orgs classification script into more well-defined pipeline #92

Comments

trevorspreadbury commented Apr 11, 2024