You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example:
generated word with wrong sequence of letters:
but the line containing the recogized word should look like this:
<pc:Unicode>مصر</pc:Unicode>
(I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)
Here is the equivalent portion of the image:
REMARK:
when using tesseract as standalone and generating alto, the sequence is correct!
The text was updated successfully, but these errors were encountered:
@bertsky thank you for the hint, in fact I have tested extequiv_level=glyph before but have seen a lot of glyphs which I couldn't assign. Now I have examined the generated xml again and found that the word itself is represented correctly. Now I realize that the too many preceding letters simply list out many recognition results on glyph level with their confidence score.
Yes, you might want to ignore the glyph level, as it contains alternative OCR hypotheses.
But the difference in the word level tells us that the blame is actually on Tesseract: it yields the wrong order when querying the result iterator on word level (and – I presume – on line and region level) for RTL script.
(The reason that the standalone CLI with ALTO renderer gets it right is merely because that only uses the glyph/symbol level iterator.)
@stweil I have not seen any examples for using the iterators on RTL data – is this a bug in Tesseract, or can we do something about it here (perhaps using ParagraphIsLtr)?
when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example:
generated word with wrong sequence of letters:
but the line containing the recogized word should look like this:
(I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)
Here is the equivalent portion of the image:
REMARK:
when using tesseract as standalone and generating alto, the sequence is correct!
The text was updated successfully, but these errors were encountered: