Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting JSON to HOCR (Segmentation Fault) #21

Open
pauf opened this issue Jan 17, 2019 · 7 comments
Open

Converting JSON to HOCR (Segmentation Fault) #21

pauf opened this issue Jan 17, 2019 · 7 comments

Comments

@pauf
Copy link

pauf commented Jan 17, 2019

First off, thanks for an awesome piece of software. For the most part, it works great!

For some reason, after converting many thousands of pages, I've come across this error for one page only:

gcv2hocr "/mydir/error1.json" "/mydir/test.hocr"

Response: "Segmentation fault"

Initially I wondered whether the JSON was too complex, or whether there was too much information leading to overflows, but looking at some of the other pages I've ran through the software this would certainly not appear to be the case.

Hope this helps.

@pauf
Copy link
Author

pauf commented Jan 17, 2019

After some further experimentation, I think I've found the issue:

    {
      "description": "R&D",
      "boundingPoly": {
        "vertices": [
          {
            "x": 1307,
            "y": 1130
          },
          {
            "x": 1342,
            "y": 1129
          },
          {
            "x": 1342,
            "y": 1141
          },
          {
            "x": 1307,
            "y": 1142
          }
        ]
      }
    },

Doesn't work (Segfault)

    {
      "description": "RAD", <--------------------------- CHANGE
      "boundingPoly": {
        "vertices": [
          {
            "x": 1307,
            "y": 1130
          },
          {
            "x": 1342,
            "y": 1129
          },
          {
            "x": 1342,
            "y": 1141
          },
          {
            "x": 1307,
            "y": 1142
          }
        ]
      }
    },

Does work.

It would seem the C version of the code (I haven't checked Python implementation) doesn't like the ampersand character (&). As this is valid output from Google, it's probably worth looking at fixing this where possible.

@dinosauria123
Copy link
Owner

Thank you for using gcv2hocr and found out the issue.

I will fix it, please wait for a while...

“&” has to replace to “&amp“ it has been implemented for single letter but this problem comes from conjectured word.

@pauf
Copy link
Author

pauf commented Jan 17, 2019

Thanks for the quick reply!

No problem, I found a solution in the meantime, which might help while we wait:

sed -i -e 's/&/&ampSEMICOLON/g' /path/to/json/file.json

@junior1q94
Copy link
Contributor

Hello, @dinosauria123 @pauf

I have encountered the same issue and decided to make a patch. It should should work for any xml entity that need to be escaped.

Hope this is useful.

@IAutil
Copy link

IAutil commented Sep 30, 2019

Hi @dinosauria123 and everybody,
I have a issue with gcv2hocr nowadays it looks like Google has changed something... I've executed test.json with json of the project(gcv2hocr) and it's ok. But if I execute google OCR with test.jpg and send this json to gcv2hocr I get different hocr. The most important thing I saw is the field "lang" wasn't parsed and the letter are now numbers...It's like a codification mistake or something like this, but it's really difficult to handle.

I paste example with test.hocr and my test.hocr:
`1. test.hocr of the project:

O p t i c a l

===============================================================

2. test.hocr of new gcv execution:

81 104 194 104 338 104 80 179 119 177 197 177 221 178

@dinosauria123
Copy link
Owner

Thank you for your report.
I will check json output but patches may be delay because now I am busy my job.

@dinosauria123
Copy link
Owner

I have checked gcv2hocr but output seems to be fine.
Did you use gcvocr.sh to get json output ?
Please attach your json output to your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants