Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Elsevier onion's ring API wrapping #18

Draft
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

lfoppiano
Copy link
Contributor

@lfoppiano lfoppiano commented Sep 6, 2024

This PR fixes the onion ring of flavourless JATS added by the Elsevier API

Can be tested on

curl --location 'https://lfoppiano-pub2tei-dev.hf.space/service/processXML'  --form 'input=@"elsevier_file"'

Credits to: @laurentromary

laurentromary and others added 2 commits September 6, 2024 16:44
Intercepts the envelop given by the API and extracts the article element for further processing.
@avsm
Copy link

avsm commented Dec 3, 2024

Thanks for this fix! I'm trying it with the Elsevier TDM responses and have a prefix like this

<?xml version="1.0" encoding="UTF-8"?>
<full-text-retrieval-response xmlns="http://www.elsevier.com/xml/svapi/article/dtd" xmlns:bk="http://www.elsevier.com/xml/bk/dtd" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:sa="http://www.elsevier.com/xml/common/struct-aff/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <coredata>
    <prism:url>
      https://api.elsevier.com/content/article/pii/S0378112700004047
    </prism:url>
    <dc:identifier>
      doi:10.1016/S0378-1127(00)00404-7
    </dc:identifier>
    <eid>
      1-s2.0-S0378112700004047
    </eid>
    <prism:doi>
      10.1016/S0378-1127(00)00404-7
    </prism:doi>
    <pii>
      S0378-1127(00)00404-7
    </pii>
    <dc:title>
      Winter activity of mammals in riparian zones and adjacent forests prior to and following clear-cutting at Copper Lake, Newfoundland, Canada
    </dc:title>
    <prism:publicationName>
      Forest Ecology and Management
    </prism:publicationName>

It doesn't quite work with the huggingface instance you set up though; I'm getting back

> POST /service/processXML HTTP/2
> Host: lfoppiano-pub2tei-dev.hf.space
> User-Agent: curl/8.9.1
> Accept: */*
> Content-Length: 149481
> Content-Type: multipart/form-data; boundary=------------------------RSn3trg7kFu3aIiFmROBNi
> 
* upload completely sent off: 149481 bytes
< HTTP/2 200 
< date: Tue, 03 Dec 2024 12:09:32 GMT
< content-type: application/xml; charset=UTF-8
< content-length: 39
< vary: Accept-Encoding
< vary: origin, access-control-request-method, access-control-request-headers
< x-proxied-host: http://10.27.71.227
< x-proxied-path: /service/processXML
< link: <https://huggingface.co/spaces/lfoppiano/pub2tei-dev>;rel="canonical"
< x-request-id: T4BqgE
< access-control-allow-credentials: true
< 
<?xml version="1.0" encoding="UTF-8"?>

And then the connection is closed. I'll try to setup a Docker instance of your PR locally later, but thought I'd check with you first on this

@lfoppiano
Copy link
Contributor Author

@avsm The huggingface instance has been redeployed with a different image, it was set up for testing the feature but then I needed back so it's normal that it does not work.

@avsm
Copy link

avsm commented Dec 4, 2024

Thanks, I set up my own Docker instance to test it out. The patch does work on some files, but for a lot I'm getting this backtrace now.

172.17.0.1 - - [04/Dec/2024:08:46:59 +0000] "POST /service/processXML HTTP/1.1" 200 39 "-" "curl/8.9.1" 11                                                            
Warning                                                                                                                                                               
  XTMM9000: Converting an Elsevier article obtained from the Elsevier API                                                                                             
[Fatal Error] :-1:-1: Premature end of file.                                                                                                                          
ERROR [2024-12-04 08:47:02,742] org.pub2tei.document.DocumentProcessor: An error occured while processing the tei document                                            
! org.xml.sax.SAXParseException: Premature end of file.                                                                                                               
! at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)                                                                                                        
! at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)                                                                                                 
! at org.pub2tei.document.DocumentProcessor.processTEI(DocumentProcessor.java:113)                                                                                    
! at org.pub2tei.document.DocumentProcessor.processXML(DocumentProcessor.java:190)                                                                                    
! at org.pub2tei.service.ProcessFile.processXML(ProcessFile.java:54)                                                                                                  
! at org.pub2tei.service.ServiceController.processXML(ServiceController.java:123)                                                                                     
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                                                                                   
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)                                                                 
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)                                                         
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)                                                                                                       
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)                
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)                             
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)                            
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)     
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)                           
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)                                                                   
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)                                                                    
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)                                                                     
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)                                                                                          
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)                                                                                                     
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)                                                                                                     
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)                                                                                                    
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)                                                                                                    
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)                                                                                                    
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)                                                                             
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)                                                                                        

The file itself does pass XML validation (and has the fulltext response header, since that is logged by the xslt sheet). Can do some more debugging on this later on tomorrow.

@lfoppiano
Copy link
Contributor Author

Can you please send me one file that work and one that does not work at luca AT sciencialab.com?

I'll try to have a look this week

@avsm
Copy link

avsm commented Dec 4, 2024

Thanks @lfoppiano -- email sent.

@lfoppiano
Copy link
Contributor Author

@avsm I checked the files you have and the one that fail don't have any body, but it seems they are just the bibliographical information. This particular (right side) article does not look like a scientific article, though:

image

@avsm
Copy link

avsm commented Dec 9, 2024

Quite right, I thought I'd filtered on article type, but in some cases the Elsevier API also seems to return a blank body where the PDF article isn't OCRed (for old papers). I'll try it tomorrow on clean papers after refining the query, but this PR is clearly an improvement already and good to merge from my perspective. Thanks for looking at those example files so promptly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants