reverse order of glyphs inside words in PAGE-File for RTL languages #185

MihoMahi · 2022-03-08T15:32:03Z

when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example:
generated word with wrong sequence of letters:

               <pc:Word id="region0001_line0001_word0000">
                    <pc:Coords points="1620,372 1620,402 1703,402 1703,375 1647,376"/>
                    <pc:TextEquiv conf="0.877831573486328">
                        <pc:Unicode>رصم</pc:Unicode>
                    </pc:TextEquiv>
                </pc:Word>

but the line containing the recogized word should look like this:

                        <pc:Unicode>مصر</pc:Unicode>

(I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)

Here is the equivalent portion of the image:

REMARK:
when using tesseract as standalone and generating alto, the sequence is correct!

The text was updated successfully, but these errors were encountered:

bertsky · 2022-03-08T15:55:47Z

Thanks @MihoMahi for the report!

Does this only happen in the default textequiv_level=word, or also with textequiv_level=glyph?

MihoMahi · 2022-03-09T07:35:20Z

@bertsky thank you for the hint, in fact I have tested extequiv_level=glyph before but have seen a lot of glyphs which I couldn't assign. Now I have examined the generated xml again and found that the word itself is represented correctly. Now I realize that the too many preceding letters simply list out many recognition results on glyph level with their confidence score.

bertsky · 2022-03-09T11:07:36Z

Yes, you might want to ignore the glyph level, as it contains alternative OCR hypotheses.

But the difference in the word level tells us that the blame is actually on Tesseract: it yields the wrong order when querying the result iterator on word level (and – I presume – on line and region level) for RTL script.

(The reason that the standalone CLI with ALTO renderer gets it right is merely because that only uses the glyph/symbol level iterator.)

@stweil I have not seen any examples for using the iterators on RTL data – is this a bug in Tesseract, or can we do something about it here (perhaps using ParagraphIsLtr)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reverse order of glyphs inside words in PAGE-File for RTL languages #185

reverse order of glyphs inside words in PAGE-File for RTL languages #185

MihoMahi commented Mar 8, 2022

bertsky commented Mar 8, 2022

MihoMahi commented Mar 9, 2022

bertsky commented Mar 9, 2022

reverse order of glyphs inside words in PAGE-File for RTL languages #185

reverse order of glyphs inside words in PAGE-File for RTL languages #185

Comments

MihoMahi commented Mar 8, 2022

bertsky commented Mar 8, 2022

MihoMahi commented Mar 9, 2022

bertsky commented Mar 9, 2022