-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
software.json files created but no mentions (and no metadata)? #6
Comments
When no software mentions are found in an article, the But I never paid attention to the fact that the metadata field is also empty ! Because without annotation, I never used these metadata information. In principle the json file containing the metadata is in the same location as the software mention annotation file and the PDF and or nxml? |
Thanks. Hmmm, so there should be a software.json file for every PDF processed? I'm only seeing 608 of those files, compared to 1,476 pdf files. I figured that meant they were only created when mentions were found? When you run this code, how many do you see? |
I ran the code, and I got:
Normally we want only the So before running the software mention extraction, I am running for example this to remove the PDF: lopez@work:~/tools/screenit-softcite$ find data/ -name *.pdf -exec /bin/rm {} \; Then, in this case, this gives 1500 Maybe you can re-run the software mention client like this to complete the run: python3 -m software_mentions_client.client --repo-in /home/lopez/tools/screenit-softcite/data/ --config my_config.json --reprocess The When running only on the 1476 PDF and skipping the TEI files, I obtain similarly 1475 |
Until the issue softcite/software_mentions_client#4 is implemented to prioritize TEI files from Pub2TEI (or from LaTeXML) origins, you have to manually remove the PDF files under |
Ok, I created the .tei.xml files, then I renamed all the PDF to .ignore (so I don't have to redownload them). I ran
but that gave me a Reading the code at https://github.com/softcite/software_mentions_client/blob/7bec98952f7e0a20e64c057320ea6fed1426fe55/software_mentions_client/client.py#L101 it seems that message comes if you haven't specified one of the "load_mongo and not full_diagnostic_mongo and not full_diagnostic_files " options? So I added Output is this, though:
which doesn't quite seem right? |
The client.log shows errors, I don't think anything processed successfully?
|
Ah, those errors seem related to leaving Now I'm seeing different errors (always a sign of progress). These I'm seeing in the server log (in the window I ran
I guess I could try building the server image for mac, although I'm unsure that the GPU stuff will work. Gah, Docker was meant to save us from this nonsense :) I will try with the scienceminer server (although eventually I do hope this pipeline can work on its own). |
Seems like this is tensor-flow on M1 architecture emulation related. Seems likely progress here: tensorflow/tensorflow#52845 (comment) I think this might require rebuilding the server image with updated tensorflow, then it might work with the X86 emulation. Or we could build mac silicon docker image as well? I'll see if I can get that to build. |
Yes in your config
Normally this is just a warning, it indicates that your TensorFlow library could run faster on your particular CPU if it was specifically built for it.
Ahh yes, the image is build for x86, not for ARM... We should try to build a multi-architecture Docker image, but 2 images might be easier. A bit disappointing for the M1 emulation ! |
I'm noticing some blank lines in the output. These seem to be created by .software.json files that don't have any metadata or mentions?
I added the
.software.json
names to the output file for cross checking. Some are a bit weird, they don't have metadata fields in the json and their mention list is empty"mentions": []
@kermitt2 any idea what might be causing those files to get written?
These PMCIDs seem to have this issue, I checked and the PDF and nxml seem ok?
8430429
8900187
8867340 (this PDF seems to be supplemental material and not the main article PDF?)
8381458
and a few others (look in the mentions_one_per_row.csv file for missing PMCID/name/context) in 5ebd7b1
The text was updated successfully, but these errors were encountered: