Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ROR data dump #213

Open
fenekku opened this issue Mar 23, 2022 · 4 comments
Open

New ROR data dump #213

fenekku opened this issue Mar 23, 2022 · 4 comments

Comments

@fenekku
Copy link
Contributor

fenekku commented Mar 23, 2022

Is your feature request related to a problem? Please describe.

ROR has released a new data dump (funnily on Zenodo): https://zenodo.org/record/6347575 . This probably means the ROR vocabulary dump here should be updated (or at least reviewed).

As was mentioned in a telecon, the ROR list in this module is a filtered one. Perhaps the filtering process can be shared too.

Describe the solution you'd like

An updated affiliations_ror.yaml.

@tmorrell
Copy link
Contributor

tmorrell commented Jul 8, 2022

I'm planning on trying to update this. I think invenio vocabularies convert -v funders -o "/path/to/ror-data-dump.json.zip" -t affiliations_ror.yaml is probably close, but happy to use another script if that is available.

@fenekku
Copy link
Contributor Author

fenekku commented Jul 11, 2022

In recent imports at NU, we've noticed that YAML shows very bad performance for loading. If you can use .jsonl file instead it would be drastically faster to load. Example: loading our 72MB+ worth of MeSH terms with YAML took ~240s while the same data in .jsonl format took 1s (we saw a x149 increase for lcsh terms too). And that's when invenio-cli services setup executes, so it greatly improves the installation flow.

@tmorrell
Copy link
Contributor

The invenio vocabularies result is close, but it's in a different order so will make a mess of a diff. ROR is also moving to monthly releases, so updating a static file in the cookicutter is not going to be sustainable. It makes more sense to transfer the affiliation vocabulary to a datastream. I've been able to get it partially working, but am still having issues getting the writers registered. Will update as I get more time to work on it.

@karkraeg
Copy link
Member

karkraeg commented May 8, 2024

Hi, I don't know if this is still a matter for you but we hacked together a Script that formats the ROR dump into the YAML InvenioRDM wants. If someone is interested I could clean it up and share. It's using pandas so filtering would be fairly easy to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants