Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcv2ocr.py does not convert json #35

Open
sarepal opened this issue May 27, 2020 · 6 comments
Open

gcv2ocr.py does not convert json #35

sarepal opened this issue May 27, 2020 · 6 comments

Comments

@sarepal
Copy link

sarepal commented May 27, 2020

I'm working with the attached JSON file from GCV but when I run the gcv2ocr.py, the hocr only has metadata and lacks content. osh-sample-1911a-0001.json.zip

@dinosauria123
Copy link
Owner

Thank you for your report.
Did you use gcvocr.sh to get json file ?

@sarepal
Copy link
Author

sarepal commented Jun 2, 2020

No, I used a script based on a Google Cloud Vision tutorial. I'll look into using the shell script instead.

@svamsip
Copy link

svamsip commented Jul 8, 2020

@sarepal @dinosauria123
Any update on how to convert above attached json file to hocr. Thanks in advance

@sarepal
Copy link
Author

sarepal commented Nov 18, 2020

Update: I got the correct API key to generate the json using gcvocr.sh and was able to convert it to hocr with gcv2ocr.py.

However, I noticed in the hocr output that there is a <span class='ocr_line'....> around every word instead of every line of text.

@dinosauria123 does gcv2ocr.py only deal with the data in the json's "textAnnotations" and not the data in "fullTextAnnotation"? Thanks.

@sarepal
Copy link
Author

sarepal commented Nov 19, 2020

I see that gcv2hocr2.py does handle fullTextAnnotation. When I try to run it this is the output I receive:

python ../gcv2hocr2.py osh-sample-1911a-0001.jpg.json > output.hocr

Traceback (most recent call last):
  File "../gcv2hocr2.py", line 184, in <module>
    page = fromResponse(resp, str(args.gcv_file.rsplit('.',1)[0]), **args.__dict__)
  File "../gcv2hocr2.py", line 103, in fromResponse
    for page_id, page_json in enumerate(resp['fullTextAnnotation']['pages']):
KeyError: 'fullTextAnnotation'

The JSON does contain a fullTextAnnotation object so I don't know why this error would occur. I'm attaching the JSON I tried to process. If there's a way to get this script to successfully run, I would be very grateful. Thanks again.
osh-sample-1911a-0001.jpg.json.zip

@sarepal
Copy link
Author

sarepal commented Nov 19, 2020

UPDATE: I now have gcv2hocr2.py working. I just edited line 103 to this and it worked:

for page_id, page_json in enumerate(resp['responses'][0]['fullTextAnnotation']['pages']):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants