Azure ocr with ocrmypdf #595

sandipan1 · 2020-07-20T08:01:59Z

ocrmypdf works great with pdfs with scanned images . However in case of handwritten letter, the tessaract-ocr engine struggles many a time.
How do I use Azure ocr API as the OCR engine keeping everything else the same

jbarlow83 · 2020-07-20T08:50:26Z

OCRmyPDF has a plugin interface that would allow you to replace Tesseract with a different OCR engine such as Azure. To the best of my knowledge no one has published a plugin that does this (or for that matter, any plugin, since the plugin interface is quite new).

OCRmyPDF can only interpret the hOCR format or a text only PDF, so you'd have to convert Azure's output to one of those two as well, since unfortunately it does not support either standard (last time I looked, anyway).

sandipan1 · 2020-07-23T20:33:38Z

The azure output looks something like
{"status": "Succeeded", "recognitionResult": {"lines": [{"boundingBox": [292, 146, 780, 144, 781, 218, 293, 220], "text": "string1", "words": [{"boundingBox": [297, 150, 774, 145, 775, 218, 300, 218], "text": "string2"}]}, {"boundingBox": [327, 215, 748, 219, 747, 255, 326, 252], "text": "string3 string4", "words": [{"boundingBox": [330, 219, 496, 219, 498, 253, 332, 251], "text": "string3"}, "text": "string4"}]}]}}

Is it possible to convert this into one of the formats that you mentioned ?

PackElend · 2020-12-23T18:23:11Z

Is it possible to convert this into one of the formats that you mentioned?

If you look at hOCR format example given on Wikipedia, I would say yes. Besides, have look here:
https://stackoverflow.com/questions/62074677/generate-hocr-from-microsoft-computer-vision-ocr

Another alternative could be https://github.com/JaidedAI/EasyOCR but it outputs in a simple list only. I think that could be converted in hOCR easily.

All3xJ · 2022-02-11T14:42:59Z

I asked here #915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does.

kkrell2016 · 2022-08-10T00:29:00Z

I asked here #915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does.

Its not that hard to implement. I have a very basic python code, that uses google vision api to get better results. For the orientation of the page I use tesseract, because its the easiest way. But for text recognition the image is sent to vision api, the json response get converted to hocr and you have your textlayer.

isspid · 2023-03-29T13:58:58Z

@kkrell2016 May I ask how the conversion from json to hocr happen? Have you written your own script for that purpose?

kkrell2016 · 2023-03-29T19:05:14Z

@isspid I found a project called gcv2hocr and combined it with some custom python code. The custom python script can be run as a plugin in ocrmypdf. I also uploaded it to my github, should be publicly available. I had to modify gcv2hocr a bit to make it work with the current Google Vision API.

If you have any questions please contact me

RAbraham · 2023-06-28T20:01:58Z

I think the above plugin interface (e.g. generate_pdf(input_file, output_pdf, output_text, options)) will be called for each page in the pdf instead of the whole pdf? Is there an interface which gives the entire pdf and we can return back a list of hocr files generated from an azure OCR result or a single hocr file for many pages(if this is possible in the hocr format, I have to learn).

Then I can then call

# something like this for multiple pages?
    helper = hocrtransform.HocrTransform(
        hocr_filename=hocr_file, # or list_of_hocr_files
        dpi=150
    )

    helper.to_pdf(out_filename=output_pdf) # a multi page pdf

Our use case is that we send a batch of pages to Azure OCR(otherwise it'll be very slow to process many pages for us) and it returns an Azure OCR result object for all the pages. I can loop through each page object of the Azure OCR result object and generate either a list of hocr objects (where one hocr object corresponds to a page) or a single hocr object(if that's possible)

RAbraham · 2023-06-28T20:25:42Z

I guess I can call the azure engine in the global part of the file and then cache it and then when generate_pdf is called, just pick it from there. but how I know which key in the cache to pick up? e.g. I'll key the cache by page number for e.g. but I won't know from generate_pdf which page it is for, as it does not provide a page number iiuc?

I noticed generate_pdfa. Would that be useful here? and then I call helper above in a loop and then merge the single page files got from helper.to_pdf?

shamoon · 2024-03-02T20:07:25Z

Curious about all this myself. Anyone have a working example of converting e.g. the Azure output to hOCR?

deajan · 2024-03-15T12:12:15Z

@shamoon Looks we found the same thread ^^
I'm currently trying to make easyocr compatible with paperless-ngx, see paperless-ngx/paperless-ngx#6056 (reply in thread)
I found an azure to hocr script, and will probably write mine for easyocr.

hcoona · 2024-06-27T02:37:29Z

Although the document say they can produce hOCR, I cannot find any workable solution.

Doc: https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/cognitive-search-with-skillsets#azure-computer-vision

Issue: Azure-Samples/cognitive-services-REST-api-samples#109

hcoona · 2024-06-27T02:38:36Z

This may helps. https://learn.microsoft.com/en-us/samples/azure-samples/azure-search-power-skills/azure-hocr-generator-sample/

code: https://github.com/azure-samples/azure-search-power-skills/tree/main/Vision/HocrGenerator

ThioJoe · 2024-11-28T00:49:18Z

I have an HOCR file and a pdf file, but how do i actually apply the HOCR file to the pdf?

It came from Google Cloud via this: https://cloud.google.com/document-ai/docs/samples/documentai-toolbox-document-to-hocr

jonashaag · 2024-12-16T19:44:13Z

@jbarlow83 I'm considering building Azure support for OCRmyPDF. Would you recommend going the hOCR route or the direct-to-PDF route from the EasyOCR plugin?

ThioJoe · 2024-12-17T18:52:36Z

@jbarlow83 I'm considering building Azure support for OCRmyPDF. Would you recommend going the hOCR route or the direct-to-PDF route from the EasyOCR plugin?

Personally I say just direct to PDF. The PDF standard is already a nightmare as it is without having to figure out how to even create the hOCR file in the first place 💀

sandipan1 added the enhancement label Jul 20, 2020

PackElend mentioned this issue Dec 23, 2020

PDF support for embed words JaidedAI/EasyOCR#49

Closed

All3xJ mentioned this issue Feb 11, 2022

Add the possibility to OCR in external websites to have more accuracy #915

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure ocr with ocrmypdf #595

Azure ocr with ocrmypdf #595

sandipan1 commented Jul 20, 2020

jbarlow83 commented Jul 20, 2020

sandipan1 commented Jul 23, 2020

PackElend commented Dec 23, 2020

All3xJ commented Feb 11, 2022

kkrell2016 commented Aug 10, 2022

isspid commented Mar 29, 2023

kkrell2016 commented Mar 29, 2023

RAbraham commented Jun 28, 2023

RAbraham commented Jun 28, 2023

shamoon commented Mar 2, 2024

deajan commented Mar 15, 2024

hcoona commented Jun 27, 2024

hcoona commented Jun 27, 2024

ThioJoe commented Nov 28, 2024 •

edited

Loading

jonashaag commented Dec 16, 2024

ThioJoe commented Dec 17, 2024

Azure ocr with ocrmypdf #595

Azure ocr with ocrmypdf #595

Comments

sandipan1 commented Jul 20, 2020

jbarlow83 commented Jul 20, 2020

sandipan1 commented Jul 23, 2020

PackElend commented Dec 23, 2020

All3xJ commented Feb 11, 2022

kkrell2016 commented Aug 10, 2022

isspid commented Mar 29, 2023

kkrell2016 commented Mar 29, 2023

RAbraham commented Jun 28, 2023

RAbraham commented Jun 28, 2023

shamoon commented Mar 2, 2024

deajan commented Mar 15, 2024

hcoona commented Jun 27, 2024

hcoona commented Jun 27, 2024

ThioJoe commented Nov 28, 2024 • edited Loading

jonashaag commented Dec 16, 2024

ThioJoe commented Dec 17, 2024

ThioJoe commented Nov 28, 2024 •

edited

Loading