Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverse order of glyphs inside words in PAGE-File for RTL languages #185

Open
MihoMahi opened this issue Mar 8, 2022 · 3 comments
Open

Comments

@MihoMahi
Copy link

MihoMahi commented Mar 8, 2022

when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example:
generated word with wrong sequence of letters:

               <pc:Word id="region0001_line0001_word0000">
                    <pc:Coords points="1620,372 1620,402 1703,402 1703,375 1647,376"/>
                    <pc:TextEquiv conf="0.877831573486328">
                        <pc:Unicode>رصم</pc:Unicode>
                    </pc:TextEquiv>
                </pc:Word>

but the line containing the recogized word should look like this:

                        <pc:Unicode>مصر</pc:Unicode>

(I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)

Here is the equivalent portion of the image:
the word Msr

REMARK:
when using tesseract as standalone and generating alto, the sequence is correct!

@bertsky
Copy link
Collaborator

bertsky commented Mar 8, 2022

Thanks @MihoMahi for the report!

Does this only happen in the default textequiv_level=word, or also with textequiv_level=glyph?

@MihoMahi
Copy link
Author

MihoMahi commented Mar 9, 2022

@bertsky thank you for the hint, in fact I have tested extequiv_level=glyph before but have seen a lot of glyphs which I couldn't assign. Now I have examined the generated xml again and found that the word itself is represented correctly. Now I realize that the too many preceding letters simply list out many recognition results on glyph level with their confidence score.

@bertsky
Copy link
Collaborator

bertsky commented Mar 9, 2022

Yes, you might want to ignore the glyph level, as it contains alternative OCR hypotheses.

But the difference in the word level tells us that the blame is actually on Tesseract: it yields the wrong order when querying the result iterator on word level (and – I presume – on line and region level) for RTL script.

(The reason that the standalone CLI with ALTO renderer gets it right is merely because that only uses the glyph/symbol level iterator.)

@stweil I have not seen any examples for using the iterators on RTL data – is this a bug in Tesseract, or can we do something about it here (perhaps using ParagraphIsLtr)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants