Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: schematic and algorithms or heuristics used in-between. #119

Open
prhbrt opened this issue Jan 8, 2024 · 2 comments
Open
Labels
documentation Improvements or additions to documentation

Comments

@prhbrt
Copy link

prhbrt commented Jan 8, 2024

I'm trying to get a better understanding of your work and creating a workflow that allows batching pages without reloading the models (which takes a lot of time currently). However, your code is sometimes somewhat hard to follow. Could you provide a (crude) schematic of the different models you're using as a graph and quick summary of the algorithmic (non-neural network) parts.

Currently I'm mostly confused by how the reading order is decided, what is the algorithm there?

@cneud
Copy link
Member

cneud commented Jan 8, 2024

Hi @prhbrt, thank you for your questions. A rough diagram showing the flow of the data through the various models can be found here.

And here is an excerpt from our paper describing the heuristics used for reading order detection:

We sort columns from left to right and any text regions they contain from top to bottom. We then divide the whole page into boxes based on separators and headings.
What we need at the early stage are the coordinates of separators, headings and where the columns are located (X-coordinates). The algorithm can be explained as follows:
First, separators (or headings) that cover the whole width or all columns of the page specify the main boxes and are read from top to down. Then the X-coordinates of columns in each main box are detected by the sum of text regions alongside the Y-axis. The minimums of this summation returns the X-coordinates of columns. If the main box includes separators covering multiple columns, those are divided into upper and lower boxes and finally the new boxes inside the main box are ordered from left to right. Reading order inside boxes with multiple columns is again from left to right. Finally, to get the reading order for text contours, the contours inside each box are ordered from top to bottom.

Note that @vahidrezanezhad is currently working on a version that infers the reading order using a machine learning model, see the most recent commits here.

@cneud
Copy link
Member

cneud commented Jan 8, 2024

that allows batching pages without reloading the models

btw, since version 0.3.0, Eynollah also has a batch mode (using the -di <directory> flag) that allows processing all images in a directory without having to reload the models for each - might perhaps be useful for you?

@cneud cneud added the documentation Improvements or additions to documentation label Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants