Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tess4j - Error opening tessdata file by non-ASCII path #190

Open
AliaksandrKi opened this issue Jul 22, 2020 · 4 comments
Open

Tess4j - Error opening tessdata file by non-ASCII path #190

AliaksandrKi opened this issue Jul 22, 2020 · 4 comments

Comments

@AliaksandrKi
Copy link

OS: Windows 10
IDE: IntelliJ
tess4j: 4.5.1

I have two folders on my disc with equal 'eng.traineddata' files:

c:/data/eng.traineddata
c:/дата/eng.traineddata 

And tesseract fails while running next code:

Tesseract instance = new Tesseract();
// instance.setDatapath("c:/data");    // works without issues
instance.setDatapath("c:/дата");    // see Error message below
instance.setLanguage("eng");

String result = instance.doOCR(new File("c:/numbers.jpg"));

Error message:

Error opening data file c:/дата/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
@nguyenq
Copy link
Owner

nguyenq commented Jul 22, 2020

The error is pretty clear: you can't have non-ASCII characters in tessdata path. 'д' is not an ASCII character.

@Snipx
Copy link

Snipx commented Jul 23, 2020

@nguyenq thanks for the feedback! Could you provide a but more context here? Like if the root cause is on the Tesseract side or on the wrapper side, are there any workarounds available or any plans to support non-ASCII paths?

@nguyenq
Copy link
Owner

nguyenq commented Jul 23, 2020

It could be JNA or it could be inside Tesseract native code. On Linux, Tesseract and its tessdata directory are placed in standard system directories, so I doubt Tesseract code would ever need to deal with non-ASCII characters in those paths.

On Windows, you may want to try with a relative path without containing non-ASCII characters to see if it would work.

Maybe related to Issue #75.

@Mararsh
Copy link

Mararsh commented Oct 8, 2020

Failure may happen when non-ascii exist in either source filename, data files names, or target filename.
Meanwhile, same file names work when run tesseract command by ProcessBuilder.

You are right that the reason may be at java side when it handle filename with local API.
A jdk bug:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8205991

@nguyenq nguyenq changed the title Tessj4 - Error opening tessdata file by non-ASCII path Tess4j - Error opening tessdata file by non-ASCII path Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants