Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make orgs classification script into more well-defined pipeline #92

Open
trevorspreadbury opened this issue Apr 11, 2024 · 0 comments
Open
Assignees

Comments

@trevorspreadbury
Copy link
Contributor

https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/blob/9db429d8f209843075b787b08dc4dded5f71a787/src/utils/orgs_classification_data_pipeline.py#L2

This seems like a good direction, but the exact purpose of this file is unclear/undocumented right now. My guess is the eventual idea is that we are combining a collection of raw data files with company information into a single csv with a well-defined schema that has details and classifications for companies. Convert this into functions and define that output schema (and it would be a good idea to do this with record linkage in mind).

Additionally right now work on these raw files is split between this and the EDA folder. The EDA folder is fine for EDA now, but shouldn't be part of final production pipeline. Move all the code for processing the raw data files into this pipeline. Then provide details in the data readme for where you retrieved each of these files.

Since the InfoGroup/DataAxel data is copywritten we can't make it publicly available in large chunks. In any case, these CSVs will grow quite large so we don't want them in the repository. Add a link to the output file of the pipeline in google drive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants