Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Model #1042

Open
uneetsingh opened this issue Aug 23, 2023 · 1 comment
Open

Segmentation Model #1042

uneetsingh opened this issue Aug 23, 2023 · 1 comment

Comments

@uneetsingh
Copy link

Hi, Thank you for building such a good open source product.

For my use case, I was looking for architecture details and training process for the segmentation model. In Delft there are models for other cases (header, citation etc) but not for segmentation.

Use case is that I am trying to build a solution for docx. The route of docx -> pdf -> grobid wasn't promising because of limitation that pdfalto or any other OCR tool has.

If you can share/point me to the documentation for segmentation model, that will be very helpful.

@kermitt2
Copy link
Owner

Hi @uneetsingh

By default the segmentation model runs with CRF using custom features. This model is working at line level, not at token level like the others.

You can train and use the segmentation model with RNN and DeLFT, see #964
but it's working less accurately than CRF for the moment. I didn't upload this model on DeLFT, but you can retrain it with:

python3 delft/applications/grobidTagger.py segmentation train_eval --architecture BidLSTM_CRF_FEATURES --input  data/sequenceLabelling/grobid/segmentation/segmentation-110322.train

4 years ago, I created a branch with Grobid supporting docx as input, see #515

It was simply using ApachePOI to parse docx and convert them to PDF. This conversion had poor results (lot's of docx parsing failures), indeed the route of docx -> pdf -> grobid is not promising. I wanted to try docx4j but I loose interest in the topic :)

A kind of docx -> xml converter, keeping some layout information, would be the best way to support docx I think. Then the segmentation model should be revisited/retrained to support this new input and new layout features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants