Segmentation Model #1042

uneetsingh · 2023-08-23T05:47:00Z

Hi, Thank you for building such a good open source product.

For my use case, I was looking for architecture details and training process for the segmentation model. In Delft there are models for other cases (header, citation etc) but not for segmentation.

Use case is that I am trying to build a solution for docx. The route of docx -> pdf -> grobid wasn't promising because of limitation that pdfalto or any other OCR tool has.

If you can share/point me to the documentation for segmentation model, that will be very helpful.

kermitt2 · 2023-08-23T11:30:55Z

Hi @uneetsingh

By default the segmentation model runs with CRF using custom features. This model is working at line level, not at token level like the others.

You can train and use the segmentation model with RNN and DeLFT, see #964
but it's working less accurately than CRF for the moment. I didn't upload this model on DeLFT, but you can retrain it with:

python3 delft/applications/grobidTagger.py segmentation train_eval --architecture BidLSTM_CRF_FEATURES --input  data/sequenceLabelling/grobid/segmentation/segmentation-110322.train

4 years ago, I created a branch with Grobid supporting docx as input, see #515

It was simply using ApachePOI to parse docx and convert them to PDF. This conversion had poor results (lot's of docx parsing failures), indeed the route of docx -> pdf -> grobid is not promising. I wanted to try docx4j but I loose interest in the topic :)

A kind of docx -> xml converter, keeping some layout information, would be the best way to support docx I think. Then the segmentation model should be revisited/retrained to support this new input and new layout features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Model #1042

Segmentation Model #1042

uneetsingh commented Aug 23, 2023

kermitt2 commented Aug 23, 2023

Segmentation Model #1042

Segmentation Model #1042

Comments

uneetsingh commented Aug 23, 2023

kermitt2 commented Aug 23, 2023