-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import data into new format #13
Import data into new format #13
Conversation
- update reasons field - update alternatives fields - add alternatives data - use unidecode to remove accents - add countries (default of global) - remove TBDs
Minor tweaks to import script
@THM222 thank you for your PR to this PR! I think I can resolve the description formatting so will try and push that soon as well. |
…ks with | rather than in double quotes
@THM222 in my original changes I didn't overwrite any of the pre-existing files, but your PR to this PR did. Just wanted to confirm if that was intentional or if we want to restore the handful of files that will be overwritten by this PR before merging? |
Good catch! that was an accident |
Nice! When i was working on it i had a quick look, and i think it might be to do with the apostrophe |
@THM222 have just pushed another commit that restores/slightly updates some pre-existing files that were lost. |
@greencloudysky looks good, but the build is failing :( Yaml validation seems to be failing for kiehls logo url.. easy enough to add one ourselves The export script (exports data to csv and json) is also failing.. i couldnt find the exact error in the CI log, but it may be due to the multiline fields. Will take a look tomorrow. Apologies for more delays! |
Ah damn, looks like more than 100 files are missing a logo. I just pushed a change to help with compliance of alternative/stakeholder names at least. |
…ncloudysky/boycott-israeli-consumer-goods-dataset into import-data-into-new-format
Addressing #8.
I have made the assumption that entries in the input dataset which mention owners are brands, while entries that don't mention an owner are companies.
The script only reads the following fields from the input data, while the rest are left as
TBD
:name
(from inputname
field)description
(from inputproof
field)stakeholders
(withid
extracted fromproof
field using a regex, andtype
defaulting toowner
)logo_url
(from inputimageUrl
field)The script can be run from the scripts directory like:
python3 import_new_schema.py ../raw/boycott_list_formatted.json
.It will not overwrite existing files, since the generated files don't contain much information.