diff --git a/CHANGELOG.md b/CHANGELOG.md index b52694ca49..1bd8938507 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,34 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). +## [0.8.0] - 2023-11-19 + +### Added + ++ Extraction of funder and funding information with a specific new model, see https://github.com/kermitt2/grobid/pull/1046 for details ++ Optional consolidation of funder with CrossRef Funder Registry ++ Identification of acknowledged entities in the acknowledgement section ++ Optional coordinates in title elements + +### Changed + ++ Dropwizard upgrade to 4.0 ++ Minimum JDK/JVM requirement for building/running the project is now 1.11 ++ Logging now with Logback, removal of Log4j2, optional logs in json format ++ General review of logs ++ Enable Github actions / Disable circleci + +### Fixed + ++ Set dynamic memory limit in pdfalto_server #1038 ++ Logging in files when training models work now as expected ++ Various dependency upgrades ++ Fix #1051 with possible problematic PDF ++ Fix #1036 for pdfalto memory limit ++ fix readthedocs build #1040 ++ fix for null equation #1030 ++ Other minor fixes + ## [0.7.3] – 2023-05-13 ### Added diff --git a/Dockerfile.delft b/Dockerfile.delft index 77629d6060..8728448841 100644 --- a/Dockerfile.delft +++ b/Dockerfile.delft @@ -2,14 +2,14 @@ ## See https://grobid.readthedocs.io/en/latest/Grobid-docker/ -## usage example with version 0.7.3: -## docker build -t grobid/grobid:0.7.3 --build-arg GROBID_VERSION=0.7.3 --file Dockerfile.delft . +## usage example with version 0.8.0: +## docker build -t grobid/grobid:0.8.0 --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.delft . ## no GPU: -## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.3 +## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.8.0 ## allocate all available GPUs (only Linux with proper nvidia driver installed on host machine): -## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.3 +## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.8.0 # ------------------- # build builder image diff --git a/build.gradle b/build.gradle index d7c160b3f0..bde1f124db 100644 --- a/build.gradle +++ b/build.gradle @@ -46,8 +46,8 @@ subprojects { publishing { publications { mavenJava(MavenPublication) { - //from components.java - artifact jar + from components.java + //artifact jar } } repositories { diff --git a/doc/Deep-Learning-models.md b/doc/Deep-Learning-models.md index 9b1c6acaf4..a10e824706 100644 --- a/doc/Deep-Learning-models.md +++ b/doc/Deep-Learning-models.md @@ -8,7 +8,7 @@ These architectures have been tested on Linux 64bit and macOS 64bit. The support Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see [here](https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink). Additionally, it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based). -There are currently no neural model for the segmentation and the fulltext models, because the input sequences for these models are too large for the current supported Deep Learning architectures. The problem would need to be formulated differently for these tasks or to use alternative DL architectures (with sliding window, etc.). +There are currently no neural model for the fulltext models, because the input sequences for this model are too large for the current supported Deep Learning architectures. The problem would need to be formulated differently for this task or to use alternative DL architectures (with sliding window, etc.). Low-level models not using layout features (author name, dates, affiliations...) perform usually better than CRF and does not require a feature channel. When layout features are involved, neural models with an additional feature channel should be preferred (e.g. `BidLSTM_CRF_FEATURES` in DeLFT) to those without feature channel. @@ -20,7 +20,7 @@ Current neural models can be up to 50 times slower than CRF, depending on the ar By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](https://grobid.readthedocs.io/en/latest/Configuration/#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing. -For current GROBID version 0.7.3, we recommend considering the usage of the following Deep Learning models: +For current GROBID version 0.8.0, we recommend considering the usage of the following Deep Learning models: - `citation` model: for bibliographical parsing, the `BidLSTM_CRF_FEATURES` architecture provides currently the best accuracy, significantly better than CRF (+3 to +5 points in F1-Score). With a GPU, there is normally no runtime impact by selecting this model. SciBERT fine-tuned model performs currently at lower accuracy. @@ -30,6 +30,8 @@ For current GROBID version 0.7.3, we recommend considering the usage of the foll - `header` model: this model extracts the header metadata, the `BidLSTM_CRF_FEATURES` or `BidLSTM_ChainCRF_FEATURES` (a faster variant) provides sligthly better results than CRF, especially with less mainstream domains and publisher. With a GPU, there is normally almost no runtime impact by selecting this DL model. +- `funding-acknowledgement` model: this is a typical NER model that extracts funder names, funding information, acknowledged persons and organizations, etc. The `BidLSTM_CRF_FEATURES` provides more accurate results than CRF. + Other Deep Learning models do not show better accuracy than old-school CRF according to our benchmarkings, so we do not recommend using them in general at this stage. However, some of them tend to be more portable and can be more reliable than CRF for document layouts and scientific domains far from what is available in the training data. Finally, the model `fulltext` (structuring the content body of a document) is currently only based on CRF, due to the long input sequences to be processed. diff --git a/doc/Frequently-asked-questions.md b/doc/Frequently-asked-questions.md index 5eb1a1ca76..b24a1a78b4 100644 --- a/doc/Frequently-asked-questions.md +++ b/doc/Frequently-asked-questions.md @@ -11,26 +11,52 @@ Exploiting the `503` mechanism is already implemented in the different GROBID cl ## Could we have some guidance for server configuration in production? -The exact server configuration will depend on the service you want to call. We present here the configuration used to process with `processFulltextDocument` around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client listed above during one week on a 16 CPU machine (16 threads, 32GB RAM, no SDD). It ran without any crash during 7 days at this rate. We processed 11.3M PDF in a bit less than 7 days with two 16-CPU servers like that in one of our projects. +The exact server configuration will depend on the service you want to call, the models selected in the Grobid configuration file (`grobid-home/config/grobid.yaml`) and the availability of GPU. We consider here the complete full text processing of PDF (`processFulltextDocument`). -- if your server has 8-10 threads available, you can use the default settings of the docker image, otherwise you would rather need to build and start the service yourself to tune the parameters +1) Using CRF models only, for example via the lightweight Docker image (https://hub.docker.com/r/lfoppiano/grobid/tags) -- keep the concurrency at the client (number of simultaneous calls) slightly higher than the available number of threads at the server side, for instance if the server has 16 threads, use a concurrency between 20 and 24 (it's the option `n` in the above mentioned clients, in my case I used 24) +- in `grobid/grobid-home/config/grobid.yaml` set the parameter `concurrency` to your number of available threads at server side or slightly higher (e.g. 16 to 20 for a 16 threads-machine) -- in `grobid/grobid-home/config/grobid.yaml` set the parameter `concurrency` to your number of available thread at server side or slightly higher (e.g. 16 to 20 for a 16 threads-machine, in my case I used 20) +- keep the concurrency at the client (number of simultaneous calls) slightly higher than the available number of threads at the server side, for instance if the server has 16 threads, use a concurrency between 20 and 24 (it's the option `n` in the above mentioned clients) -- set `modelPreload` to `true`in `grobid/grobid-home/config/grobid.yaml`, it will avoid some strange behavior at launch +These settings will ensure that CPU are fully used when processing a large set of PDF. -- in the query, `consolidateHeader` can be `1` or `2` if you are using the biblio-glutton or CrossRef consolidation. It significantly improves the accuracy and add useful metadata. +For example, with these settings, we processed with `processFulltextDocument` around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client during one week on a 16 CPU machine (16 threads, 32GB RAM, no SDD). It ran without any crash during 7 days at this rate. We processed 11.3M PDF in a bit less than 7 days with two 16-CPU servers in one of our projects. -- If you want to consolidate all the bibliographical references and use `consolidateCitations` as `1` or `2`, CrossRef query rate limit will avoid scale to more than 1 document per second... For scaling the bibliographical reference resolution, you will need to use a local consolidation service, [biblio-glutton](https://github.com/kermitt2/biblio-glutton). The overall capacity will depend on the biblio-glutton service then, and the number of elasticsearch nodes you can exploit. From experience, it is difficult to go beyond 300K PDF per day when using consolidation for every extracted bibliographical references. +Note: if your server has 8-10 threads available, you can use the default settings of the docker image, otherwise you will need to modify the configuration file to tune the parameters, as [documented](Configuration.md). + +2) Using Deep Learning models, for example via the full Docker image () + +2.1) If the server has a GPU + +In case the server has a GPU, which has its own memory, the Deep Learning inferences are automatically parallelized on this GPU, without impacting the CPU and RAM memmory. The settings given above in 1) can normally be use similarly. + +2.2) If the server has CPU only + +When Deep Learning models run as well on CPU as fallback, the CPU are used more intensively (DL models push CPU computations quite a lot), more irregularly (Deep Learning models are called at certain point in the overall process, but not continuously) and the CPU will use additional RAM memory to load those larger models. For the DL inference on CPU, an additional thread is created, allocating its own memory. We can have up to 2 times more CPU used at peaks, and approx. up to 50% more memory. + +The settings should thus be considered as follow: + +- in `grobid/grobid-home/config/grobid.yaml` set the parameter `concurrency` to your number of available threads at server side divided by 2 (8 threads available, set concurrency to `4`) + +- keep the concurrency at the client (number of simultaneous calls) at the same level as the `concurrency` parameter at server side, for instance if the server has 16 threads, use a `concurrency` of `8` and the client concurrency at `8` (it's the option `n` in the clients) + +In addition, consider more RAM memory when running Deep Learning model on CPU, e.g. 24-32GB memory with concurrency at `8` instead of 16GB. + +3) In general, consider also these settings: + +- Set `modelPreload` to `true` in `grobid/grobid-home/config/grobid.yaml`, it will avoid some strange behavior at launch (this is the default setting). + +- Regarding the query parameters, `consolidateHeader` can be `1` or `2` if you are using the biblio-glutton or CrossRef consolidation. It significantly improves the accuracy and add useful metadata. + +- If you want to consolidate all the bibliographical references and use `consolidateCitations` as `1` or `2`, the CrossRef query rate limit will make the scaling to more than 1 document per second impossible (so Grobid would typically wait 90% or more of its time waiting for CrossRef API responses)... For scaling the bibliographical reference resolutions, you will need to use a local consolidation service, [biblio-glutton](https://github.com/kermitt2/biblio-glutton). The overall capacity will depend on the biblio-glutton service then, and the number of elasticsearch nodes you can exploit. From experience, it is difficult to go beyond 300K PDF per day when using consolidation for every extracted bibliographical references. ## I would also like to extract images from PDFs You will get the embedded images converted into `.png` by using the normal batch command. For instance: ```console -java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn ~/test/in/ -dOut ~/test/out -exe processFullText +java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn ~/test/in/ -dOut ~/test/out -exe processFullText ``` There is a web service doing the same, returning everything in a big zip file, `processFulltextAssetDocument`, still usable but deprecated. diff --git a/doc/Grobid-batch.md b/doc/Grobid-batch.md index 7022bc7800..bd936935ea 100644 --- a/doc/Grobid-batch.md +++ b/doc/Grobid-batch.md @@ -20,7 +20,7 @@ The following command display some help for the batch commands: Be sure to replace `` with the current version of GROBID that you have installed and built. For example: ```bash -> java -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -h +> java -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -h ``` The available batch commands are listed bellow. For those commands, at least `-Xmx1G` is used to set the JVM memory to avoid *OutOfMemoryException* given the current size of the Grobid models and the crazyness of some PDF. For complete fulltext processing, which involve all the GROBID models, `-Xmx4G` is recommended (although allocating less memory is usually fine). @@ -42,7 +42,7 @@ The needed parameters for that command are: Example: ```bash -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -68,7 +68,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText +> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText ``` WARNING: the expected extension of the PDF files to be processed is .pdf @@ -82,7 +82,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf Example: ```bash -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format" +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format" ``` ### processAuthorsHeader @@ -94,7 +94,7 @@ Example: Example: ```bash -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors" +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors" ``` ### processAuthorsCitation @@ -106,7 +106,7 @@ Example: Example: ```bash -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors" +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors" ``` ### processAffiliation @@ -118,7 +118,7 @@ Example: Example: ```bash -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation" +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation" ``` ### processRawReference @@ -130,7 +130,7 @@ Example: Example: ```bash -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string" +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string" ``` ### processReferences @@ -146,7 +146,7 @@ Example: Example: ```bash -> java -Xmx2G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences +> java -Xmx2G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences ``` WARNING: the expected extension of the PDF files to be processed is `.pdf` @@ -162,7 +162,7 @@ WARNING: the expected extension of the PDF files to be processed is `.pdf` Example: ```bash -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36 +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36 ``` WARNING: extension of the ST.36 files to be processed must be `.xml` @@ -178,7 +178,7 @@ WARNING: extension of the ST.36 files to be processed must be `.xml` Example: ``` -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT ``` WARNING: extension of the text files to be processed must be `.txt`, and expected encoding is `UTF-8` @@ -194,7 +194,7 @@ WARNING: extension of the text files to be processed must be `.txt`, and expecte Example: ``` -> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF +> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF ``` WARNING: extension of the text files to be processed must be `.pdf` @@ -210,7 +210,7 @@ WARNING: extension of the text files to be processed must be `.pdf` Example: ```bash -> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining +> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining ``` WARNING: the expected extension of the PDF files to be processed is `.pdf` @@ -226,7 +226,7 @@ WARNING: the expected extension of the PDF files to be processed is `.pdf` Example: ```bash -> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank +> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank ``` WARNING: the expected extension of the PDF files to be processed is `.pdf` @@ -244,7 +244,7 @@ The needed parameters for that command are: Example: ```bash -> java -Xmx2G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation +> java -Xmx2G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation ``` WARNING: extension of the text files to be processed must be `.pdf` diff --git a/doc/Grobid-docker.md b/doc/Grobid-docker.md index 6517d1cf60..fd839dc6fb 100644 --- a/doc/Grobid-docker.md +++ b/doc/Grobid-docker.md @@ -26,13 +26,13 @@ The process for retrieving and running the image is as follow: Current latest version: ```bash -> docker pull grobid/grobid:0.7.3 +> docker pull grobid/grobid:0.8.0 ``` - Run the container: ```bash -> docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.7.3 +> docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0 ``` The image will automatically uses the GPU and CUDA version available on your host machine, but only on Linux. GPU usage via a container on Windows and MacOS machine is currently not supported by Docker. If no GPU are available, CPU will be used. @@ -88,7 +88,7 @@ The process for retrieving and running the image is as follow: Latest version: ```bash -> docker pull lfoppiano/grobid:0.7.3 +> docker pull lfoppiano/grobid:0.8.0 ``` - Run the container: @@ -100,7 +100,7 @@ Latest version: Latest version: ```bash -> docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.7.3 +> docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0 ``` Note the default version is running on port `8070`, however it can be mapped on the more traditional port `8080` of your host with the following command: @@ -121,7 +121,7 @@ Grobid web services are then available as described in the [service documentatio The simplest way to pass a modified configuration to the docker image is to mount the yaml GROBID config file `grobid.yaml` when running the image. Modify the config file `grobid/grobid-home/config/grobid.yaml` according to your requirements on the host machine and mount it when running the image as follow: ```bash -docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.3 +docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.0 ``` You need to use an absolute path to specify your modified `grobid.yaml` file. @@ -222,25 +222,25 @@ Without this requirement, the image might default to CPU, even if GPU are availa For being able to use both CRF and Deep Learningmodels, use the dockerfile `./Dockerfile.delft`. The only important information then is the version which will be checked out from the tags. ```bash -> docker build -t grobid/grobid:0.7.3 --build-arg GROBID_VERSION=0.7.3 --file Dockerfile.delft . +> docker build -t grobid/grobid:0.8.0 --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.delft . ``` Similarly, if you want to create a docker image from the current master, development version: ```bash -docker build -t grobid/grobid:0.8.0-SNAPSHOT --build-arg GROBID_VERSION=0.8.0-SNAPSHOT --file Dockerfile.delft . +docker build -t grobid/grobid:0.8.1-SNAPSHOT --build-arg GROBID_VERSION=0.8.1-SNAPSHOT --file Dockerfile.delft . ``` -In order to run the container of the newly created image, for example for the development version `0.8.0-SNAPSHOT`, using all GPU available: +In order to run the container of the newly created image, for example for the development version `0.8.1-SNAPSHOT`, using all GPU available: ```bash -> docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 grobid/grobid:0.8.0-SNAPSHOT +> docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 grobid/grobid:0.8.1-SNAPSHOT ``` In practice, you need to indicate which models should use a Deep Learning model implementation and which ones can remain with a faster CRF model implementation, which is done currently in the `grobid.yaml` file. Modify the config file `grobid/grobid-home/config/grobid.yaml` accordingly on the host machine and mount it when running the image as follow: ```bash -docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.0-SNAPSHOT +docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.1-SNAPSHOT ``` You need to use an absolute path to specify your modified `grobid.yaml` file. @@ -262,19 +262,19 @@ The container name is given by the command: For building a CRF-only image, the dockerfile to be used is `./Dockerfile.crf`. The only important information then is the version which will be checked out from the tags. ```bash -> docker build -t grobid/grobid:0.7.3 --build-arg GROBID_VERSION=0.7.3 --file Dockerfile.crf . +> docker build -t grobid/grobid:0.8.0 --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.crf . ``` Similarly, if you want to create a docker image from the current master, development version: ```bash -> docker build -t grobid/grobid:0.8.0-SNAPSHOT --build-arg GROBID_VERSION=0.8.0-SNAPSHOT --file Dockerfile.crf . +> docker build -t grobid/grobid:0.8.1-SNAPSHOT --build-arg GROBID_VERSION=0.8.1-SNAPSHOT --file Dockerfile.crf . ``` -In order to run the container of the newly created image, for example for version `0.7.3`: +In order to run the container of the newly created image, for example for version `0.8.1`: ```bash -> docker run --rm --init --ulimit core=0 -p 8080:8070 -p 8081:8071 grobid/grobid:0.7.3 +> docker run --rm --init --ulimit core=0 -p 8080:8070 -p 8081:8071 grobid/grobid:0.8.1 ``` For testing or debugging purposes, you can connect to the container with a bash shell (logs are under `/opt/grobid/logs/`): diff --git a/doc/Grobid-java-library.md b/doc/Grobid-java-library.md index ab8695630f..7c2e99f535 100644 --- a/doc/Grobid-java-library.md +++ b/doc/Grobid-java-library.md @@ -9,7 +9,7 @@ The second option is of course to build yourself Grobid and to use the generated ## Using maven -The Java artefacts of the latest GROBID release (0.7.3) are uploaded on a DIY repository. +The Java artefacts of the latest GROBID release (0.8.0) are uploaded on a DIY repository. You need to add the following snippet in your `pom.xml` in order to configure it: @@ -29,19 +29,19 @@ Here an example of `grobid-core` dependency: org.grobid grobid-core - 0.7.3 + 0.8.0 ``` -If you want to work on a SNAPSHOT development version, you need to download and build the current master yourself, and include in your pom file the path to the local snapshot Grobid jar file, for instance as follow (if necessary replace `0.8.0-SNAPSHOT` by the valid ``): +If you want to work on a SNAPSHOT development version, you need to download and build the current master yourself, and include in your pom file the path to the local snapshot Grobid jar file, for instance as follow (if necessary replace `0.8.1-SNAPSHOT` by the valid ``): ```xml org.grobid grobid-core - 0.8.0-SNAPSHOT + 0.8.1-SNAPSHOT system - ${project.basedir}/lib/grobid-core-0.8.0-SNAPSHOT.jar + ${project.basedir}/lib/grobid-core-0.8.1-SNAPSHOT.jar ``` @@ -59,8 +59,8 @@ Add the following snippet in your gradle.build file: and add the Grobid dependency as well: ``` - implement 'org.grobid:grobid-core:0.7.3' - implement 'org.grobid:grobid-trainer:0.7.3' + implement 'org.grobid:grobid-core:0.8.0' + implement 'org.grobid:grobid-trainer:0.8.0' ``` ## API call diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md index f51957d4eb..ccd333a571 100644 --- a/doc/Grobid-service.md +++ b/doc/Grobid-service.md @@ -27,9 +27,9 @@ From a development installation, you can also build and install the service as a cd .. mkdir grobid-installation cd grobid-installation -unzip ../grobid/grobid-service/build/distributions/grobid-service-0.7.3.zip -mv grobid-service-0.7.3 grobid-service -unzip ../grobid/grobid-home/build/distributions/grobid-home-0.7.3.zip +unzip ../grobid/grobid-service/build/distributions/grobid-service-0.8.0.zip +mv grobid-service-0.8.0 grobid-service +unzip ../grobid/grobid-home/build/distributions/grobid-home-0.8.0.zip ./grobid-service/bin/grobid-service ``` diff --git a/doc/Install-Grobid.md b/doc/Install-Grobid.md index f207aff4c2..01d2da98ef 100644 --- a/doc/Install-Grobid.md +++ b/doc/Install-Grobid.md @@ -8,17 +8,17 @@ Note: Java/JDK 8 is not supported anymore from Grobid version `0.8.0` and the mi ### Latest stable release -The [latest stable release](https://github.com/kermitt2/grobid#latest-version) of GROBID is version ```0.7.3``` which can be downloaded as follow: +The [latest stable release](https://github.com/kermitt2/grobid#latest-version) of GROBID is version ```0.8.0``` which can be downloaded as follow: ```bash -> wget https://github.com/kermitt2/grobid/archive/0.7.3.zip -> unzip 0.7.3.zip +> wget https://github.com/kermitt2/grobid/archive/0.8.0.zip +> unzip 0.8.0.zip ``` or using the [docker](Grobid-docker.md) container. ### Current development version -The current development version is ```0.8.0-SNAPSHOT```, which can be downloaded from GitHub and built as follow: +The current development version is ```0.8.1-SNAPSHOT```, which can be downloaded from GitHub and built as follow: Clone source code from github: ```bash diff --git a/doc/Notes-grobid-developers.md b/doc/Notes-grobid-developers.md index 2c916fc1d3..151d17ffd5 100644 --- a/doc/Notes-grobid-developers.md +++ b/doc/Notes-grobid-developers.md @@ -9,11 +9,11 @@ The idea anyway is that people will use Grobid with the Docker image, the servic In order to make a new release: -+ tag the project branch to be releases, for instance a version `0.7.3`: ++ tag the project branch to be releases, for instance a version `0.8.0`: ``` -> git tag 0.7.3 -> git push origin 0.7.3 +> git tag 0.8.0 +> git push origin 0.8.0 ``` + create a github release: the easiest is to use the GitHub web interface @@ -55,7 +55,7 @@ for maven projects: org.grobid grobid-core - 0.7.3 + 0.8.0 ``` diff --git a/doc/Run-Grobid.md b/doc/Run-Grobid.md index 2771bdea80..673a2c20c5 100644 --- a/doc/Run-Grobid.md +++ b/doc/Run-Grobid.md @@ -9,13 +9,13 @@ For convenience, we provide two docker images: - the **full** image provides the best accuracy, because it includes all the required python and TensorFlow libraries, GPU support and all Deep Learning model resources. However it requires more resources, ideally a GPU (it will be automatically detected on Linux). If you have a limited amount of PDF, a good machine, and prioritize accuracy, use this Grobid flavor. To run this version of Grobid, the command is: ```console -docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.7.3 +docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0 ``` - the **lightweight** image offers best runtime performance, memory usage and Docker image size. However, it does not use some of the best performing models in term of accuracy. If you have a lot of PDF to process, a low resource system, and accuracy is not so important, use this flavor: ```console -docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.7.3 +docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0 ``` More documentation on the Docker images can be found [here](Grobid-docker.md). diff --git a/gradle.properties b/gradle.properties index 3877ccff7b..71e51bf250 100644 --- a/gradle.properties +++ b/gradle.properties @@ -1,4 +1,4 @@ -version=0.8.0-SNAPSHOT +version=0.8.0 # Set workers to 1 that even for parallel builds it works. (I guess the shadow plugin makes some trouble) org.gradle.workers.max=1 org.gradle.caching = true diff --git a/grobid-core/src/main/java/org/grobid/core/engines/Engine.java b/grobid-core/src/main/java/org/grobid/core/engines/Engine.java index 8a73f1911c..4ea16f6796 100755 --- a/grobid-core/src/main/java/org/grobid/core/engines/Engine.java +++ b/grobid-core/src/main/java/org/grobid/core/engines/Engine.java @@ -141,7 +141,7 @@ public List processDate(String dateBlock) throws IOEx }*/ /** - * Apply a parsing model for a given single raw reference string based on CRF + * Apply a parsing model for a given single raw reference string * * @param reference the reference string to be processed * @param consolidate the consolidation option allows GROBID to exploit Crossref web services for improving header @@ -157,7 +157,7 @@ public BiblioItem processRawReference(String reference, int consolidate) { } /** - * Apply a parsing model for a set of raw reference text based on CRF + * Apply a parsing model for a set of raw reference text * * @param references the list of raw reference strings to be processed * @param consolidate the consolidation option allows GROBID to exploit Crossref web services for improving header @@ -230,7 +230,7 @@ public Engine(boolean loadModels) { } /** - * Apply a parsing model to the reference block of a PDF file based on CRF + * Apply a parsing model to the reference block of a PDF file * * @param inputFile the path of the PDF file to be processed * @param consolidate the consolidation option allows GROBID to exploit Crossref web services for improving header @@ -245,7 +245,7 @@ public List processReferences(File inputFile, int consolidate) { } /** - * Apply a parsing model to the reference block of a PDF file based on CRF + * Apply a parsing model to the reference block of a PDF file * * @param inputFile the path of the PDF file to be processed * @param md5Str MD5 digest of the PDF file to be processed @@ -335,7 +335,7 @@ public Language runLanguageId(String filePath) { } /** - * Apply a parsing model for the header of a PDF file based on CRF, using + * Apply a parsing model for the header of a PDF file, using * first three pages of the PDF * * @param inputFile the path of the PDF file to be processed @@ -362,7 +362,36 @@ public String processHeader( } /** - * Apply a parsing model for the header of a PDF file based on CRF, using + * Apply a parsing model for the header of a PDF file combined with an extraction and parsing of + * funding information (outside the header possibly) + * + * @param inputFile the path of the PDF file to be processed + * @param consolidateHeader the consolidation option allows GROBID to exploit Crossref web services for improving header + * information. 0 (no consolidation, default value), 1 (consolidate the citation and inject extra + * metadata) or 2 (consolidate the citation and inject DOI only) + * @param consolidateFunder the consolidation option allows GROBID to exploit Crossref Funder Registry web services for improving header + * information. 0 (no consolidation, default value), 1 (consolidate the citation and inject extra + * metadata) or 2 (consolidate the citation and inject DOI only) + * @param result bib result + * @return the TEI representation of the extracted bibliographical + * information + */ + public String processHeaderFunding( + File inputFile, + int consolidateHeader, + int consolidateFunders, + boolean includeRawAffiliations + ) throws Exception { + GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder() + .consolidateHeader(consolidateHeader) + .consolidateFunders(consolidateFunders) + .includeRawAffiliations(includeRawAffiliations) + .build(); + return processHeaderFunding(inputFile, null, config); + } + + /** + * Apply a parsing model for the header of a PDF file, using * first three pages of the PDF * * @param inputFile the path of the PDF file to be processed @@ -391,7 +420,38 @@ public String processHeader( } /** - * Apply a parsing model for the header of a PDF file based on CRF, using + * Apply a parsing model for the header of a PDF file combined with an extraction and parsing of + * funding information (outside the header possibly) + * + * @param inputFile the path of the PDF file to be processed + * @param md5Str MD5 digest of the processed file + * @param consolidateHeader the consolidation option allows GROBID to exploit Crossref web services for improving header + * information. 0 (no consolidation, default value), 1 (consolidate the citation and inject extra + * metadata) or 2 (consolidate the citation and inject DOI only) + * @param consolidateFunder the consolidation option allows GROBID to exploit Crossref Funder Registry web services for improving header + * information. 0 (no consolidation, default value), 1 (consolidate the citation and inject extra + * metadata) or 2 (consolidate the citation and inject DOI only) + * @param result bib result + * @return the TEI representation of the extracted bibliographical + * information + */ + public String processHeaderFunding( + File inputFile, + String md5Str, + int consolidateHeader, + int consolidateFunders, + boolean includeRawAffiliations + ) throws Exception { + GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder() + .consolidateHeader(consolidateHeader) + .consolidateFunders(consolidateFunders) + .includeRawAffiliations(includeRawAffiliations) + .build(); + return processHeaderFunding(inputFile, md5Str, config); + } + + /** + * Apply a parsing model for the header of a PDF file, using * dynamic range of pages as header * * @param inputFile : the path of the PDF file to be processed @@ -411,6 +471,10 @@ public String processHeader(String inputFile, GrobidAnalysisConfig config, Bibli return processHeader(inputFile, null, config, result); } + public String processHeaderFunding(File inputFile, GrobidAnalysisConfig config) throws Exception { + return processHeaderFunding(inputFile, null, config); + } + public String processHeader(String inputFile, String md5Str, GrobidAnalysisConfig config, BiblioItem result) { // normally the BiblioItem reference must not be null, but if it is the // case, we still continue @@ -423,12 +487,23 @@ public String processHeader(String inputFile, String md5Str, GrobidAnalysisConfi return resultTEI.getLeft(); } + public String processHeaderFunding(File inputFile, String md5Str, GrobidAnalysisConfig config) throws Exception { + FullTextParser fullTextParser = parsers.getFullTextParser(); + Document resultDoc; + LOGGER.debug("Starting processing fullTextToTEI on " + inputFile); + long time = System.currentTimeMillis(); + resultDoc = fullTextParser.processingHeaderFunding(inputFile, md5Str, config); + LOGGER.debug("Ending processing fullTextToTEI on " + inputFile + ". Time to process: " + + (System.currentTimeMillis() - time) + "ms"); + return resultDoc.getTei(); + } + /** * Create training data for the monograph model based on the application of * the current monograph text model on a new PDF * * @param inputFile : the path of the PDF file to be processed - * @param pathRaw : the path where to put the CRF feature file + * @param pathRaw : the path where to put the sequence labeling feature file * @param pathTEI : the path where to put the annotated TEI representation (the * file to be corrected for gold-level training data) * @param id : an optional ID to be used in the TEI file and the full text @@ -443,7 +518,7 @@ public void createTrainingMonograph(File inputFile, String pathRaw, String pathT * without tags. This can be used to start from scratch any new model. * * @param inputFile : the path of the PDF file to be processed - * @param pathRaw : the path where to put the CRF feature file + * @param pathRaw : the path where to put the sequence labeling feature file * @param pathTEI : the path where to put the annotated TEI representation (the * file to be annotated for "from scratch" training data) * @param id : an optional ID to be used in the TEI file and the full text @@ -458,7 +533,7 @@ public void createTrainingBlank(File inputFile, String pathRaw, String pathTEI, * the current full text model on a new PDF * * @param inputFile : the path of the PDF file to be processed - * @param pathRaw : the path where to put the CRF feature file + * @param pathRaw : the path where to put the sequence labeling feature file * @param pathTEI : the path where to put the annotated TEI representation (the * file to be corrected for gold-level training data) * @param id : an optional ID to be used in the TEI file, -1 if not used @@ -592,7 +667,7 @@ public boolean accept(File dir, String name) { * * @param directoryPath - the path to the directory containing PDF to be processed. * @param resultPath - the path to the directory where the results as XML files - * and CRF feature files shall be written. + * and the sequence labeling feature files shall be written. * @param ind - identifier integer to be included in the resulting files to * identify the training case. This is optional: no identifier * will be included if ind = -1 @@ -643,7 +718,7 @@ public boolean accept(File dir, String name) { * * @param directoryPath - the path to the directory containing PDF to be processed. * @param resultPath - the path to the directory where the results as XML files - * and default CRF feature files shall be written. + * and default sequence labeling feature files shall be written. * @param ind - identifier integer to be included in the resulting files to * identify the training case. This is optional: no identifier * will be included if ind = -1 diff --git a/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java b/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java index 200c54061e..08fabc0540 100755 --- a/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java +++ b/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java @@ -101,6 +101,14 @@ public Document processing(File inputPdf, return processing(documentSource, config); } + public Document processingHeaderFunding(File inputPdf, + GrobidAnalysisConfig config) throws Exception { + DocumentSource documentSource = + DocumentSource.fromPdf(inputPdf, config.getStartPage(), config.getEndPage(), + config.getPdfAssetPath() != null, true, false); + return processingHeaderFunding(documentSource, config); + } + public Document processing(File inputPdf, String md5Str, GrobidAnalysisConfig config) throws Exception { @@ -111,6 +119,16 @@ public Document processing(File inputPdf, return processing(documentSource, config); } + public Document processingHeaderFunding(File inputPdf, + String md5Str, + GrobidAnalysisConfig config) throws Exception { + DocumentSource documentSource = + DocumentSource.fromPdf(inputPdf, config.getStartPage(), config.getEndPage(), + config.getPdfAssetPath() != null, true, false); + documentSource.setMD5(md5Str); + return processingHeaderFunding(documentSource, config); + } + /** * Machine-learning recognition of the complete full text structures. * @@ -313,6 +331,78 @@ else if (config.getConsolidateCitations() == 2) } } + + /** + * Machine-learning recognition of full text structures limted to header and funding information. + * This requires however to look at the complete document, but some parts will be skipped + * + * @param documentSource input + * @param config config + * @return the document object with built TEI + */ + public Document processingHeaderFunding(DocumentSource documentSource, + GrobidAnalysisConfig config) { + if (tmpPath == null) { + throw new GrobidResourceException("Cannot process pdf file, because temp path is null."); + } + if (!tmpPath.exists()) { + throw new GrobidResourceException("Cannot process pdf file, because temp path '" + + tmpPath.getAbsolutePath() + "' does not exists."); + } + try { + // general segmentation + Document doc = parsers.getSegmentationParser().processing(documentSource, config); + SortedSet documentBodyParts = doc.getDocumentPart(SegmentationLabels.BODY); + + // header processing + BiblioItem resHeader = new BiblioItem(); + Pair featSeg = null; + + // using the segmentation model to identify the header zones + parsers.getHeaderParser().processingHeaderSection(config, doc, resHeader, false); + + // structure the abstract using the fulltext model + if (isNotBlank(resHeader.getAbstract())) { + //List abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT); + List abstractTokens = resHeader.getAbstractTokensWorkingCopy(); + if (CollectionUtils.isNotEmpty(abstractTokens)) { + abstractTokens = BiblioItem.cleanAbstractLayoutTokens(abstractTokens); + Pair> abstractProcessed = processShort(abstractTokens, doc); + if (abstractProcessed != null) { + // neutralize figure and table annotations (will be considered as paragraphs) + String labeledAbstract = abstractProcessed.getLeft(); + labeledAbstract = postProcessFullTextLabeledText(labeledAbstract); + resHeader.setLabeledAbstract(labeledAbstract); + resHeader.setLayoutTokensForLabel(abstractProcessed.getRight(), TaggingLabels.HEADER_ABSTRACT); + } + } + } + + // possible annexes (view as a piece of full text similar to the body) + /*documentBodyParts = doc.getDocumentPart(SegmentationLabels.ANNEX); + featSeg = getBodyTextFeatured(doc, documentBodyParts); + String resultAnnex = null; + List tokenizationsBody2 = null; + if (featSeg != null && isNotEmpty(trim(featSeg.getLeft()))) { + // if featSeg is null, it usually means that no body segment is found in the + // document segmentation + String bodytext = featSeg.getLeft(); + tokenizationsBody2 = featSeg.getRight().getTokenization(); + resultAnnex = label(bodytext); + }*/ + + // final combination + toTEIHeaderFunding(doc, // document + resHeader, // header + config); + return doc; + } catch (GrobidException e) { + throw e; + } catch (Exception e) { + throw new GrobidException("An exception occurred while running Grobid.", e); + } + } + /** * Process a simple segment of layout tokens with the full text model. * Return null if provided Layout Tokens is empty or if structuring failed. @@ -2627,6 +2717,137 @@ private void toTEI(Document doc, // ); } + /** + * Create the TEI representation for a document based on the parsed header and funding only. + */ + private void toTEIHeaderFunding(Document doc, + BiblioItem resHeader, + GrobidAnalysisConfig config) { + if (doc.getBlocks() == null) { + return; + } + TEIFormatter teiFormatter = new TEIFormatter(doc, this); + StringBuilder tei = new StringBuilder(); + try { + List fundings = new ArrayList<>(); + + List annexStatements = new ArrayList<>(); + + // acknowledgement is in the back + StringBuilder acknowledgmentStmt = getSectionAsTEI("acknowledgement", "\t\t\t", doc, SegmentationLabels.ACKNOWLEDGEMENT, + teiFormatter, null, config); + + if (acknowledgmentStmt.length() > 0) { + MutablePair,List,List>> localResult = + parsers.getFundingAcknowledgementParser().processingXmlFragment(acknowledgmentStmt.toString(), config); + + if (localResult != null && localResult.getLeft() != null) { + String local_tei = localResult.getLeft().toXML(); + local_tei = local_tei.replace(" xmlns=\"http://www.tei-c.org/ns/1.0\"", ""); + annexStatements.add(local_tei); + } + else { + annexStatements.add(acknowledgmentStmt.toString()); + } + + if (localResult != null && localResult.getRight() != null && localResult.getRight().getLeft() != null) { + List localFundings = localResult.getRight().getLeft(); + if (localFundings.size()>0) { + fundings.addAll(localFundings); + } + } + } + + // funding in header + StringBuilder fundingStmt = new StringBuilder(); + if (StringUtils.isNotBlank(resHeader.getFunding())) { + List headerFundingTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_FUNDING); + + Pair> headerFundingProcessed = processShort(headerFundingTokens, doc); + if (headerFundingProcessed != null) { + fundingStmt = teiFormatter.processTEIDivSection("funding", + "\t\t\t", + headerFundingProcessed.getLeft(), + headerFundingProcessed.getRight(), + null, + config); + } + if (fundingStmt.length() > 0) { + MutablePair,List,List>> localResult = + parsers.getFundingAcknowledgementParser().processingXmlFragment(fundingStmt.toString(), config); + + if (localResult != null && localResult.getLeft() != null) { + String local_tei = localResult.getLeft().toXML(); + local_tei = local_tei.replace(" xmlns=\"http://www.tei-c.org/ns/1.0\"", ""); + annexStatements.add(local_tei); + } else { + annexStatements.add(fundingStmt.toString()); + } + + if (localResult != null && localResult.getRight() != null && localResult.getRight().getLeft() != null) { + List localFundings = localResult.getRight().getLeft(); + if (localFundings.size()>0) { + fundings.addAll(localFundings); + } + } + } + } + + // funding statements in non-header part + fundingStmt = getSectionAsTEI("funding", + "\t\t\t", + doc, + SegmentationLabels.FUNDING, + teiFormatter, + null, + config); + if (fundingStmt.length() > 0) { + MutablePair,List,List>> localResult = + parsers.getFundingAcknowledgementParser().processingXmlFragment(fundingStmt.toString(), config); + + if (localResult != null && localResult.getLeft() != null){ + String local_tei = localResult.getLeft().toXML(); + local_tei = local_tei.replace(" xmlns=\"http://www.tei-c.org/ns/1.0\"", ""); + annexStatements.add(local_tei); + } else { + annexStatements.add(fundingStmt.toString()); + } + + if (localResult != null && localResult.getRight() != null && localResult.getRight().getLeft() != null) { + List localFundings = localResult.getRight().getLeft(); + if (localFundings.size()>0) { + fundings.addAll(localFundings); + } + } + } + + tei.append(teiFormatter.toTEIHeader(resHeader, null, null, null, fundings, config)); + tei.append("\t\t"); + + for (String annexStatement : annexStatements) { + tei.append("\n\t\t\t"); + tei.append(annexStatement); + } + + if (fundings != null && fundings.size() >0) { + tei.append("\n\t\t\t\n"); + for(Funding funding : fundings) { + if (funding.isNonEmptyFunding()) + tei.append(funding.toTEI(4)); + } + tei.append("\t\t\t\n"); + } + + tei.append("\t\t\n"); + + tei.append("\t\n"); + tei.append("\n"); + } catch (Exception e) { + throw new GrobidException("An exception occurred while running Grobid.", e); + } + doc.setTei(tei.toString()); + } + private StringBuilder getSectionAsTEI(String xmlType, String indentation, Document doc, diff --git a/grobid-service/src/main/java/org/grobid/service/GrobidPaths.java b/grobid-service/src/main/java/org/grobid/service/GrobidPaths.java index 2ae0f640f2..f860d8fe78 100755 --- a/grobid-service/src/main/java/org/grobid/service/GrobidPaths.java +++ b/grobid-service/src/main/java/org/grobid/service/GrobidPaths.java @@ -30,6 +30,11 @@ public interface GrobidPaths { */ String PATH_HEADER = "processHeaderDocument"; + /** + * path extension for processing document headers and funding information. + */ + String PATH_HEADER_FUNDING = "processHeaderFundingDocument"; + /** * path extension for processing document headers HTML. */ diff --git a/grobid-service/src/main/java/org/grobid/service/GrobidRestService.java b/grobid-service/src/main/java/org/grobid/service/GrobidRestService.java index 498ba06524..54f6b3e502 100755 --- a/grobid-service/src/main/java/org/grobid/service/GrobidRestService.java +++ b/grobid-service/src/main/java/org/grobid/service/GrobidRestService.java @@ -165,6 +165,24 @@ public Response processHeaderDocumentReturnXml_post( ); } + @Path(PATH_HEADER_FUNDING) + @Consumes(MediaType.MULTIPART_FORM_DATA) + @Produces(MediaType.APPLICATION_XML) + @POST + public Response processHeaderFundingDocumentReturnXml_post( + @FormDataParam(INPUT) InputStream inputStream, + @DefaultValue("0") @FormDataParam(CONSOLIDATE_HEADER) String consolidateHeader, + @DefaultValue("0") @FormDataParam(CONSOLIDATE_FUNDERS) String consolidateFunders, + @DefaultValue("0") @FormDataParam(INCLUDE_RAW_AFFILIATIONS) String includeRawAffiliations) { + int consolHeader = validateConsolidationParam(consolidateHeader); + int consolFunders = validateConsolidationParam(consolidateFunders); + return restProcessFiles.processStatelessHeaderFundingDocument( + inputStream, consolHeader, consolFunders, + validateIncludeRawParam(includeRawAffiliations) + ); + } + + @Path(PATH_HEADER) @Consumes(MediaType.MULTIPART_FORM_DATA) @Produces(MediaType.APPLICATION_XML) diff --git a/grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessFiles.java b/grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessFiles.java index a6ea881eb0..14caf8b218 100644 --- a/grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessFiles.java +++ b/grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessFiles.java @@ -136,6 +136,86 @@ public Response processStatelessHeaderDocument( return response; } + /** + * Uploads the origin document which shall be extracted into TEI and + * extracts only the header and funding information, this still requires a full read and segmentation of the document, + * but non-relevant parts are skipt. + * + * @param inputStream the data of origin document + * @param consolidateHeader consolidation parameter for the header extraction + * @param consolidateFunders consolidation parameter for the funder extraction + * @return a response object which contains a TEI representation of the header part + */ + public Response processStatelessHeaderFundingDocument( + final InputStream inputStream, + final int consolidateHeader, + final int consolidateFunders, + final boolean includeRawAffiliations + ) { + LOGGER.debug(methodLogIn()); + String retVal = null; + Response response = null; + File originFile = null; + Engine engine = null; + try { + engine = Engine.getEngine(true); + // conservative check, if no engine is free in the pool a NoSuchElementException is normally thrown + if (engine == null) { + throw new GrobidServiceException( + "No GROBID engine available", Status.SERVICE_UNAVAILABLE); + } + + MessageDigest md = MessageDigest.getInstance("MD5"); + DigestInputStream dis = new DigestInputStream(inputStream, md); + + originFile = IOUtilities.writeInputFile(dis); + byte[] digest = md.digest(); + + if (originFile == null) { + LOGGER.error("The input file cannot be written."); + throw new GrobidServiceException( + "The input file cannot be written. ", Status.INTERNAL_SERVER_ERROR); + } + + String md5Str = DatatypeConverter.printHexBinary(digest).toUpperCase(); + + // starts conversion process + retVal = engine.processHeaderFunding( + originFile, + md5Str, + consolidateHeader, + consolidateFunders, + includeRawAffiliations + ); + + if (GrobidRestUtils.isResultNullOrEmpty(retVal)) { + response = Response.status(Response.Status.NO_CONTENT).build(); + } else { + response = Response.status(Response.Status.OK) + .entity(retVal) + .header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_XML + "; charset=UTF-8") + .build(); + } + + } catch (NoSuchElementException nseExp) { + LOGGER.error("Could not get an engine from the pool within configured time. Sending service unavailable."); + response = Response.status(Status.SERVICE_UNAVAILABLE).build(); + } catch (Exception exp) { + LOGGER.error("An unexpected exception occurs. ", exp); + response = Response.status(Status.INTERNAL_SERVER_ERROR).entity(exp.getMessage()).build(); + } finally { + if (originFile != null) + IOUtilities.removeTempFile(originFile); + + if (engine != null) { + GrobidPoolingFactory.returnEngine(engine); + } + } + + LOGGER.debug(methodLogOut()); + return response; + } + /** * Uploads the origin document which shall be extracted into TEI. *