You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, Thank you for building such a good open source product.
For my use case, I was looking for architecture details and training process for the segmentation model. In Delft there are models for other cases (header, citation etc) but not for segmentation.
Use case is that I am trying to build a solution for docx. The route of docx -> pdf -> grobid wasn't promising because of limitation that pdfalto or any other OCR tool has.
If you can share/point me to the documentation for segmentation model, that will be very helpful.
The text was updated successfully, but these errors were encountered:
By default the segmentation model runs with CRF using custom features. This model is working at line level, not at token level like the others.
You can train and use the segmentation model with RNN and DeLFT, see #964
but it's working less accurately than CRF for the moment. I didn't upload this model on DeLFT, but you can retrain it with:
4 years ago, I created a branch with Grobid supporting docx as input, see #515
It was simply using ApachePOI to parse docx and convert them to PDF. This conversion had poor results (lot's of docx parsing failures), indeed the route of docx -> pdf -> grobid is not promising. I wanted to try docx4j but I loose interest in the topic :)
A kind of docx -> xml converter, keeping some layout information, would be the best way to support docx I think. Then the segmentation model should be revisited/retrained to support this new input and new layout features.
Hi, Thank you for building such a good open source product.
For my use case, I was looking for architecture details and training process for the segmentation model. In Delft there are models for other cases (header, citation etc) but not for segmentation.
Use case is that I am trying to build a solution for docx. The route of docx -> pdf -> grobid wasn't promising because of limitation that pdfalto or any other OCR tool has.
If you can share/point me to the documentation for segmentation model, that will be very helpful.
The text was updated successfully, but these errors were encountered: