Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

software.json files created but no mentions (and no metadata)? #6

Open
jameshowison opened this issue Jul 26, 2023 · 10 comments
Open

Comments

@jameshowison
Copy link
Contributor

I'm noticing some blank lines in the output. These seem to be created by .software.json files that don't have any metadata or mentions?

I added the .software.json names to the output file for cross checking. Some are a bit weird, they don't have metadata fields in the json and their mention list is empty "mentions": []

@kermitt2 any idea what might be causing those files to get written?

These PMCIDs seem to have this issue, I checked and the PDF and nxml seem ok?

8430429
8900187
8867340 (this PDF seems to be supplemental material and not the main article PDF?)
8381458

and a few others (look in the mentions_one_per_row.csv file for missing PMCID/name/context) in 5ebd7b1

@kermitt2
Copy link
Member

When no software mentions are found in an article, the .software.json file has an empty mention field. "mentions": []. So it's just to report runtime and date. You can skip this entry?

But I never paid attention to the fact that the metadata field is also empty ! Because without annotation, I never used these metadata information. In principle the json file containing the metadata is in the same location as the software mention annotation file and the PDF and or nxml?

@jameshowison
Copy link
Contributor Author

Thanks. Hmmm, so there should be a software.json file for every PDF processed? I'm only seeing 608 of those files, compared to 1,476 pdf files. I figured that meant they were only created when mentions were found? When you run this code, how many do you see?

@kermitt2
Copy link
Member

I ran the code, and I got:

  • 1476 PDF
  • 1500 nxml files
  • 1500 .pub2tei.tei.xml files

Normally we want only the .pub2tei.tei.xml file to be processed for better quality, not the PDF.

So before running the software mention extraction, I am running for example this to remove the PDF:

lopez@work:~/tools/screenit-softcite$ find data/ -name *.pdf  -exec /bin/rm {} \;

Then, in this case, this gives 1500 .software.json files, so one per .pub2tei.tei.xml file. 608 of those files is not what is expected !

Maybe you can re-run the software mention client like this to complete the run:

 python3 -m software_mentions_client.client --repo-in /home/lopez/tools/screenit-softcite/data/ --config my_config.json --reprocess

The --reprocess argument means to re-process the failing files.

When running only on the 1476 PDF and skipping the TEI files, I obtain similarly 1475 .software.json files (there's one parsing failure apparently). So it's also more interesting to run the .pub2tei.tei.xml files than the PDF for completeness (not all the PMC articles have a PDF, but all should have a nxml file).

@kermitt2
Copy link
Member

Until the issue softcite/software_mentions_client#4 is implemented to prioritize TEI files from Pub2TEI (or from LaTeXML) origins, you have to manually remove the PDF files under data/. In any cases it's important to get 1500 .software.json files before exporting into CSV.

@jameshowison
Copy link
Contributor Author

Ok, I created the .tei.xml files, then I renamed all the PDF to .ignore (so I don't have to redownload them).

I ran

python3 -m software_mentions_client.client --repo-in /home/lopez/tools/screenit-softcite/data/ --config my_config.json --reprocess

but that gave me a Softcite software mention service not available, leaving... error. I checked all the networking with the docker containers and that seemed correct.

Reading the code at https://github.com/softcite/software_mentions_client/blob/7bec98952f7e0a20e64c057320ea6fed1426fe55/software_mentions_client/client.py#L101 it seems that message comes if you haven't specified one of the "load_mongo and not full_diagnostic_mongo and not full_diagnostic_files " options? So I added --diagnostic-files and it seemed to run.

Output is this, though:

logs are written in client.log
total reprocess: 1500 - accumulated runtime: 26.576 s - 56.44 files/s  

---
total entries: 2975
---
total successfully processed: 0
---
total failed: 2975
---
Files visited: 600

--- SOFTWARE MENTIONS ---
JSON files - number of documents:  605
JSON files - number of documents with at least one software mention:  394
JSON files - number of software mentions:  1671
	     -> subtype standalone: 1276
	     -> subtype environment: 353
	     -> subtype component: 21
	     -> subtype implicit: 21
	     * with software name: 1671
	     * with version: 603
	     * with publisher: 400
	     * with url: 230
	     * with programming language: 11
	     * mentions with at least one reference 282
---
JSON files - number of bibliographical reference markers:  341
JSON files - number of bibliographical references:  153
	      * with DOI: 46
	      * with PMID: 0
	      * with PMC ID: 0
---

which doesn't quite seem right?

@jameshowison
Copy link
Contributor Author

The client.log shows errors, I don't think anything processed successfully?

INFO:root:blacklist size: 533
ERROR:root:The request failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/software_mentions_client/client.py", line 489, in annotate
    response = requests.post(url, files=the_file, data = {'disambiguate': 1}, timeout=self.config["timeout"])
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 697, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 794, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 'server_software_mentions:8060/service/annotateSoftwareTEI'
ERROR:root:The request failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/software_mentions_client/client.py", line 489, in annotate
    response = requests.post(url, files=the_file, data = {'disambiguate': 1}, timeout=self.config["timeout"])
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 697, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 794, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 'server_software_mentions:8060/service/annotateSoftwareTEI'
ERROR:root:The request failed

@jameshowison
Copy link
Contributor Author

Ah, those errors seem related to leaving http:// off the config/software_mentions_url.

Now I'm seeing different errors (always a sign of progress). These I'm seeing in the server log (in the window I ran docker-compose up. These seem likely related to mac m1/docker incompatibility.

INFO  [2023-07-27 17:00:09,124] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 37
screenit-softcite-server_software_mentions-1  | The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
screenit-softcite-server_software_mentions-1  | 
screenit-softcite-server_software_mentions-1  | 
screenit-softcite-server_software_mentions-1  | qemu: uncaught target signal 6 (Aborted) - core dumped

I guess I could try building the server image for mac, although I'm unsure that the GPU stuff will work.

Gah, Docker was meant to save us from this nonsense :)

I will try with the scienceminer server (although eventually I do hope this pipeline can work on its own).

@jameshowison
Copy link
Contributor Author

Seems like this is tensor-flow on M1 architecture emulation related. Seems likely progress here: tensorflow/tensorflow#52845 (comment)

I think this might require rebuilding the server image with updated tensorflow, then it might work with the X86 emulation. Or we could build mac silicon docker image as well? I'll see if I can get that to build.

@kermitt2
Copy link
Member

Ah, those errors seem related to leaving http:// off the config/software_mentions_url.

Yes in your config server_software_mentions:8060 is likely localhost:8060.

screenit-softcite-server_software_mentions-1 | The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.

Normally this is just a warning, it indicates that your TensorFlow library could run faster on your particular CPU if it was specifically built for it.

Gah, Docker was meant to save us from this nonsense :)

Ahh yes, the image is build for x86, not for ARM... We should try to build a multi-architecture Docker image, but 2 images might be easier. A bit disappointing for the M1 emulation !

@jameshowison
Copy link
Contributor Author

Yeah, I'm trying to build the image from source (following instructions from the software-mentions repo. The docker build is getting to layer 38

[stage-1 38/38] RUN ./gradlew clean assemble install --no-daemon --stacktrace --info -x test  

and then hanging, seemingly forever (the buildx build bit was a second attempt, me trying to improve things)
Screen Shot 2023-07-27 at 2 36 22 PM

No idea why. Docker has loads of resources (shows free space and RAM).

If you have a way of building an image for the mac silicon M1 that would be great. Might be as simple as adding --platform linux/aarch64 (but probably not :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants