Merge pull request #1060 from kermitt2/pre-release-0.8.0

Pre release 0.8.0
kermitt2 · Nov 26, 2023 · df28769 · df28769
2 parents 169dcfd + b7d62a7
commit df28769
Show file tree

Hide file tree

Showing 18 changed files with 533 additions and 78 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,34 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
+## [0.8.0] - 2023-11-19
+
+### Added
+
++ Extraction of funder and funding information with a specific new model, see https://github.com/kermitt2/grobid/pull/1046 for details
++ Optional consolidation of funder with CrossRef Funder Registry
++ Identification of acknowledged entities in the acknowledgement section
++ Optional coordinates in title elements
+
+### Changed
+
++ Dropwizard upgrade to 4.0
++ Minimum JDK/JVM requirement for building/running the project is now 1.11
++ Logging now with Logback, removal of Log4j2, optional logs in json format
++ General review of logs
++ Enable Github actions / Disable circleci
+
+### Fixed
+
++ Set dynamic memory limit in pdfalto_server #1038 
++ Logging in files when training models work now as expected
++ Various dependency upgrades
++ Fix #1051 with possible problematic PDF
++ Fix #1036 for pdfalto memory limit 
++ fix readthedocs build #1040 
++ fix for null equation #1030
++ Other minor fixes
+
 ## [0.7.3] – 2023-05-13
 
 ### Added

diff --git a/Dockerfile.delft b/Dockerfile.delft
@@ -2,14 +2,14 @@
 
 ## See https://grobid.readthedocs.io/en/latest/Grobid-docker/
 
-## usage example with version 0.7.3:
-## docker build -t grobid/grobid:0.7.3 --build-arg GROBID_VERSION=0.7.3 --file Dockerfile.delft .
+## usage example with version 0.8.0:
+## docker build -t grobid/grobid:0.8.0 --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.delft .
 
 ## no GPU:
-## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro  grobid/grobid:0.7.3
+## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro  grobid/grobid:0.8.0
 
 ## allocate all available GPUs (only Linux with proper nvidia driver installed on host machine):
-## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro  grobid/grobid:0.7.3
+## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro  grobid/grobid:0.8.0
 
 # -------------------
 # build builder image

diff --git a/build.gradle b/build.gradle
@@ -46,8 +46,8 @@ subprojects {
     publishing {
         publications {
             mavenJava(MavenPublication) {
-                //from components.java
-                artifact jar 
+                from components.java
+                //artifact jar 
             }
         }
         repositories {

diff --git a/doc/Deep-Learning-models.md b/doc/Deep-Learning-models.md
@@ -8,7 +8,7 @@ These architectures have been tested on Linux 64bit and macOS 64bit. The support
 
 Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see [here](https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink). Additionally, it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based).
 
-There are currently no neural model for the segmentation and the fulltext models, because the input sequences for these models are too large for the current supported Deep Learning architectures. The problem would need to be formulated differently for these tasks or to use alternative DL architectures (with sliding window, etc.).
+There are currently no neural model for the fulltext models, because the input sequences for this model are too large for the current supported Deep Learning architectures. The problem would need to be formulated differently for this task or to use alternative DL architectures (with sliding window, etc.).
 
 Low-level models not using layout features (author name, dates, affiliations...) perform usually better than CRF and does not require a feature channel. When layout features are involved, neural models with an additional feature channel should be preferred (e.g. `BidLSTM_CRF_FEATURES` in DeLFT) to those without feature channel.
 
@@ -20,7 +20,7 @@ Current neural models can be up to 50 times slower than CRF, depending on the ar
 
 By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](https://grobid.readthedocs.io/en/latest/Configuration/#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing. 
 
-For current GROBID version 0.7.3, we recommend considering the usage of the following Deep Learning models: 
+For current GROBID version 0.8.0, we recommend considering the usage of the following Deep Learning models: 
 
 - `citation` model: for bibliographical parsing, the `BidLSTM_CRF_FEATURES` architecture provides currently the best accuracy, significantly better than CRF (+3 to +5 points in F1-Score). With a GPU, there is normally no runtime impact by selecting this model. SciBERT fine-tuned model performs currently at  lower accuracy. 
 
@@ -30,6 +30,8 @@ For current GROBID version 0.7.3, we recommend considering the usage of the foll
 
 - `header` model: this model extracts the header metadata, the `BidLSTM_CRF_FEATURES` or `BidLSTM_ChainCRF_FEATURES` (a faster variant) provides sligthly better results than CRF, especially with less mainstream domains and publisher. With a GPU, there is normally almost no runtime impact by selecting this DL model. 
 
+- `funding-acknowledgement` model: this is a typical NER model that extracts funder names, funding information, acknowledged persons and organizations, etc. The `BidLSTM_CRF_FEATURES` provides more accurate results than CRF.
+
 Other Deep Learning models do not show better accuracy than old-school CRF according to our benchmarkings, so we do not recommend using them in general at this stage. However, some of them tend to be more portable and can be more reliable than CRF for document layouts and scientific domains far from what is available in the training data.
 
 Finally, the model `fulltext` (structuring the content body of a document) is currently only based on CRF, due to the long input sequences to be processed. 

diff --git a/doc/Frequently-asked-questions.md b/doc/Frequently-asked-questions.md
@@ -11,26 +11,52 @@ Exploiting the `503` mechanism is already implemented in the different GROBID cl
 
 ## Could we have some guidance for server configuration in production?
 
-The exact server configuration will depend on the service you want to call. We present here the configuration used to process with `processFulltextDocument` around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client listed above during one week on a 16 CPU machine (16 threads, 32GB RAM, no SDD). It ran without any crash during 7 days at this rate. We processed 11.3M PDF in a bit less than 7 days with two 16-CPU servers like that in one of our projects. 
+The exact server configuration will depend on the service you want to call, the models selected in the Grobid configuration file (`grobid-home/config/grobid.yaml`) and the availability of GPU. We consider here the complete full text processing of PDF (`processFulltextDocument`). 
 
-- if your server has 8-10 threads available, you can use the default settings of the docker image, otherwise you would rather need to build and start the service yourself to tune the parameters
+1) Using CRF models only, for example via the lightweight Docker image (https://hub.docker.com/r/lfoppiano/grobid/tags) 
 
-- keep the concurrency at the client (number of simultaneous calls) slightly higher than the available number of threads at the server side, for instance if the server has 16 threads, use a concurrency between 20 and 24 (it's the option `n` in the above mentioned clients, in my case I used 24)
+- in `grobid/grobid-home/config/grobid.yaml` set the parameter `concurrency` to your number of available threads at server side or slightly higher (e.g. 16 to 20 for a 16 threads-machine)
 
-- in `grobid/grobid-home/config/grobid.yaml` set the parameter `concurrency` to your number of available thread at server side or slightly higher (e.g. 16 to 20 for a 16 threads-machine, in my case I used 20)
+- keep the concurrency at the client (number of simultaneous calls) slightly higher than the available number of threads at the server side, for instance if the server has 16 threads, use a concurrency between 20 and 24 (it's the option `n` in the above mentioned clients)
 
-- set `modelPreload` to `true`in `grobid/grobid-home/config/grobid.yaml`, it will avoid some strange behavior at launch 
+These settings will ensure that CPU are fully used when processing a large set of PDF.  
 
-- in the query, `consolidateHeader` can be `1`  or `2` if you are using the biblio-glutton or CrossRef consolidation. It significantly improves the accuracy and add useful metadata.
+For example, with these settings, we processed with `processFulltextDocument` around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client during one week on a 16 CPU machine (16 threads, 32GB RAM, no SDD). It ran without any crash during 7 days at this rate. We processed 11.3M PDF in a bit less than 7 days with two 16-CPU servers in one of our projects. 
 
-- If you want to consolidate all the bibliographical references and use `consolidateCitations` as `1` or `2`, CrossRef query rate limit will avoid scale to more than 1 document per second... For scaling the bibliographical reference resolution, you will need to use a local consolidation service, [biblio-glutton](https://github.com/kermitt2/biblio-glutton). The overall capacity will depend on the biblio-glutton service then, and the number of elasticsearch nodes you can exploit. From experience, it is difficult to go beyond 300K PDF per day when using consolidation for every extracted bibliographical references. 
+Note: if your server has 8-10 threads available, you can use the default settings of the docker image, otherwise you will need to modify the configuration file to tune the parameters, as [documented](Configuration.md).
+
+2) Using Deep Learning models, for example via the full Docker image (<https://hub.docker.com/r/grobid/grobid/tags>) 
+
+2.1) If the server has a GPU
+
+In case the server has a GPU, which has its own memory, the Deep Learning inferences are automatically parallelized on this GPU, without impacting the CPU and RAM memmory. The settings given above in 1) can normally be use similarly.
+
+2.2) If the server has CPU only
+
+When Deep Learning models run as well on CPU as fallback, the CPU are used more intensively (DL models push CPU computations quite a lot), more irregularly (Deep Learning models are called at certain point in the overall process, but not continuously) and the CPU will use additional RAM memory to load those larger models. For the DL inference on CPU, an additional thread is created, allocating its own memory. We can have up to 2 times more CPU used at peaks, and approx. up to 50% more memory. 
+
+The settings should thus be considered as follow: 
+
+- in `grobid/grobid-home/config/grobid.yaml` set the parameter `concurrency` to your number of available threads at server side divided by 2 (8 threads available, set concurrency to `4`)
+
+- keep the concurrency at the client (number of simultaneous calls) at the same level as the `concurrency` parameter at server side, for instance if the server has 16 threads, use a `concurrency` of `8` and the client concurrency at `8` (it's the option `n` in the clients)
+
+In addition, consider more RAM memory when running Deep Learning model on CPU, e.g. 24-32GB memory with concurrency at `8` instead of 16GB.
+
+3) In general, consider also these settings:
+
+- Set `modelPreload` to `true` in `grobid/grobid-home/config/grobid.yaml`, it will avoid some strange behavior at launch (this is the default setting).
+
+- Regarding the query parameters, `consolidateHeader` can be `1`  or `2` if you are using the biblio-glutton or CrossRef consolidation. It significantly improves the accuracy and add useful metadata.
+
+- If you want to consolidate all the bibliographical references and use `consolidateCitations` as `1` or `2`, the CrossRef query rate limit will make the scaling to more than 1 document per second impossible (so Grobid would typically wait 90% or more of its time waiting for CrossRef API responses)... For scaling the bibliographical reference resolutions, you will need to use a local consolidation service, [biblio-glutton](https://github.com/kermitt2/biblio-glutton). The overall capacity will depend on the biblio-glutton service then, and the number of elasticsearch nodes you can exploit. From experience, it is difficult to go beyond 300K PDF per day when using consolidation for every extracted bibliographical references. 
 
 ## I would also like to extract images from PDFs
 
 You will get the embedded images converted into `.png` by using the normal batch command. For instance:
 
 ```console
-java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.7.3-onejar.jar -gH grobid-home -dIn ~/test/in/ -dOut ~/test/out -exe processFullText 
+java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn ~/test/in/ -dOut ~/test/out -exe processFullText 
 ```
 
 There is a web service doing the same, returning everything in a big zip file, `processFulltextAssetDocument`, still usable but deprecated.