software.json files created but no mentions (and no metadata)? #6

jameshowison · 2023-07-26T21:48:45Z

I'm noticing some blank lines in the output. These seem to be created by .software.json files that don't have any metadata or mentions?

I added the .software.json names to the output file for cross checking. Some are a bit weird, they don't have metadata fields in the json and their mention list is empty "mentions": []

@kermitt2 any idea what might be causing those files to get written?

These PMCIDs seem to have this issue, I checked and the PDF and nxml seem ok?

8430429
8900187
8867340 (this PDF seems to be supplemental material and not the main article PDF?)
8381458

and a few others (look in the mentions_one_per_row.csv file for missing PMCID/name/context) in 5ebd7b1

The text was updated successfully, but these errors were encountered:

kermitt2 · 2023-07-26T22:31:13Z

When no software mentions are found in an article, the .software.json file has an empty mention field. "mentions": []. So it's just to report runtime and date. You can skip this entry?

But I never paid attention to the fact that the metadata field is also empty ! Because without annotation, I never used these metadata information. In principle the json file containing the metadata is in the same location as the software mention annotation file and the PDF and or nxml?

jameshowison · 2023-07-26T22:36:46Z

Thanks. Hmmm, so there should be a software.json file for every PDF processed? I'm only seeing 608 of those files, compared to 1,476 pdf files. I figured that meant they were only created when mentions were found? When you run this code, how many do you see?

kermitt2 · 2023-07-27T01:18:34Z

I ran the code, and I got:

1476 PDF
1500 nxml files
1500 .pub2tei.tei.xml files

Normally we want only the .pub2tei.tei.xml file to be processed for better quality, not the PDF.

So before running the software mention extraction, I am running for example this to remove the PDF:

lopez@work:~/tools/screenit-softcite$ find data/ -name *.pdf  -exec /bin/rm {} \;

Then, in this case, this gives 1500 .software.json files, so one per .pub2tei.tei.xml file. 608 of those files is not what is expected !

Maybe you can re-run the software mention client like this to complete the run:

 python3 -m software_mentions_client.client --repo-in /home/lopez/tools/screenit-softcite/data/ --config my_config.json --reprocess

The --reprocess argument means to re-process the failing files.

When running only on the 1476 PDF and skipping the TEI files, I obtain similarly 1475 .software.json files (there's one parsing failure apparently). So it's also more interesting to run the .pub2tei.tei.xml files than the PDF for completeness (not all the PMC articles have a PDF, but all should have a nxml file).

kermitt2 · 2023-07-27T10:42:22Z

Until the issue softcite/software_mentions_client#4 is implemented to prioritize TEI files from Pub2TEI (or from LaTeXML) origins, you have to manually remove the PDF files under data/. In any cases it's important to get 1500 .software.json files before exporting into CSV.

jameshowison · 2023-07-27T16:53:37Z

Ok, I created the .tei.xml files, then I renamed all the PDF to .ignore (so I don't have to redownload them).

I ran

python3 -m software_mentions_client.client --repo-in /home/lopez/tools/screenit-softcite/data/ --config my_config.json --reprocess

but that gave me a Softcite software mention service not available, leaving... error. I checked all the networking with the docker containers and that seemed correct.

Reading the code at https://github.com/softcite/software_mentions_client/blob/7bec98952f7e0a20e64c057320ea6fed1426fe55/software_mentions_client/client.py#L101 it seems that message comes if you haven't specified one of the "load_mongo and not full_diagnostic_mongo and not full_diagnostic_files " options? So I added --diagnostic-files and it seemed to run.

Output is this, though:

logs are written in client.log
total reprocess: 1500 - accumulated runtime: 26.576 s - 56.44 files/s  

---
total entries: 2975
---
total successfully processed: 0
---
total failed: 2975
---
Files visited: 600

--- SOFTWARE MENTIONS ---
JSON files - number of documents:  605
JSON files - number of documents with at least one software mention:  394
JSON files - number of software mentions:  1671
	     -> subtype standalone: 1276
	     -> subtype environment: 353
	     -> subtype component: 21
	     -> subtype implicit: 21
	     * with software name: 1671
	     * with version: 603
	     * with publisher: 400
	     * with url: 230
	     * with programming language: 11
	     * mentions with at least one reference 282
---
JSON files - number of bibliographical reference markers:  341
JSON files - number of bibliographical references:  153
	      * with DOI: 46
	      * with PMID: 0
	      * with PMC ID: 0
---

which doesn't quite seem right?

jameshowison · 2023-07-27T16:55:19Z

The client.log shows errors, I don't think anything processed successfully?

INFO:root:blacklist size: 533
ERROR:root:The request failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/software_mentions_client/client.py", line 489, in annotate
    response = requests.post(url, files=the_file, data = {'disambiguate': 1}, timeout=self.config["timeout"])
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 697, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 794, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 'server_software_mentions:8060/service/annotateSoftwareTEI'
ERROR:root:The request failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/software_mentions_client/client.py", line 489, in annotate
    response = requests.post(url, files=the_file, data = {'disambiguate': 1}, timeout=self.config["timeout"])
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 697, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 794, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 'server_software_mentions:8060/service/annotateSoftwareTEI'
ERROR:root:The request failed

jameshowison · 2023-07-27T17:02:46Z

Ah, those errors seem related to leaving http:// off the config/software_mentions_url.

Now I'm seeing different errors (always a sign of progress). These I'm seeing in the server log (in the window I ran docker-compose up. These seem likely related to mac m1/docker incompatibility.

INFO  [2023-07-27 17:00:09,124] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 37
screenit-softcite-server_software_mentions-1  | The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
screenit-softcite-server_software_mentions-1  | 
screenit-softcite-server_software_mentions-1  | 
screenit-softcite-server_software_mentions-1  | qemu: uncaught target signal 6 (Aborted) - core dumped

I guess I could try building the server image for mac, although I'm unsure that the GPU stuff will work.

Gah, Docker was meant to save us from this nonsense :)

I will try with the scienceminer server (although eventually I do hope this pipeline can work on its own).

jameshowison · 2023-07-27T17:14:36Z

Seems like this is tensor-flow on M1 architecture emulation related. Seems likely progress here: tensorflow/tensorflow#52845 (comment)

I think this might require rebuilding the server image with updated tensorflow, then it might work with the X86 emulation. Or we could build mac silicon docker image as well? I'll see if I can get that to build.

kermitt2 · 2023-07-27T19:09:04Z

Ah, those errors seem related to leaving http:// off the config/software_mentions_url.

Yes in your config server_software_mentions:8060 is likely localhost:8060.

screenit-softcite-server_software_mentions-1 | The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.

Normally this is just a warning, it indicates that your TensorFlow library could run faster on your particular CPU if it was specifically built for it.

Gah, Docker was meant to save us from this nonsense :)

Ahh yes, the image is build for x86, not for ARM... We should try to build a multi-architecture Docker image, but 2 images might be easier. A bit disappointing for the M1 emulation !

jameshowison · 2023-07-27T19:39:40Z

Yeah, I'm trying to build the image from source (following instructions from the software-mentions repo. The docker build is getting to layer 38

[stage-1 38/38] RUN ./gradlew clean assemble install --no-daemon --stacktrace --info -x test

and then hanging, seemingly forever (the buildx build bit was a second attempt, me trying to improve things)

No idea why. Docker has loads of resources (shows free space and RAM).

If you have a way of building an image for the mac silicon M1 that would be great. Might be as simple as adding --platform linux/aarch64 (but probably not :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

software.json files created but no mentions (and no metadata)? #6

software.json files created but no mentions (and no metadata)? #6

jameshowison commented Jul 26, 2023

kermitt2 commented Jul 26, 2023

jameshowison commented Jul 26, 2023

kermitt2 commented Jul 27, 2023

kermitt2 commented Jul 27, 2023

jameshowison commented Jul 27, 2023

jameshowison commented Jul 27, 2023

jameshowison commented Jul 27, 2023

jameshowison commented Jul 27, 2023

kermitt2 commented Jul 27, 2023

jameshowison commented Jul 27, 2023

software.json files created but no mentions (and no metadata)? #6

software.json files created but no mentions (and no metadata)? #6

Comments

jameshowison commented Jul 26, 2023

kermitt2 commented Jul 26, 2023

jameshowison commented Jul 26, 2023

kermitt2 commented Jul 27, 2023

kermitt2 commented Jul 27, 2023

jameshowison commented Jul 27, 2023

jameshowison commented Jul 27, 2023

jameshowison commented Jul 27, 2023

jameshowison commented Jul 27, 2023

kermitt2 commented Jul 27, 2023

jameshowison commented Jul 27, 2023