Content recognition (NVDA+R): let users choose from a list of available content recognition engines/strategies #17406

josephsl · 2024-11-16T14:33:56Z

Hi,

Follow-up to #17405:

Background:

In addition to Windows OCR, add-ons were developed to offer alternative content recognition strategies such as use of online AI/language model services. However, in ordre to use third party strategies, users must use the interface and commands offered by add-ons.

### Is your feature request related to a problem? Please describe.
At the moment NVDA+R (content recognition) relies on Windows OCR to perform text recognition.

Describe the solution you'd like

Allow users to use different strategies when performing content recognition, thereby allowing scenraios such as pressing NVDA+R to recognize content via third-party services.

Describe alternatives you've considered

Leave the experience as is.

Additional context

There might be cases where Windows OCR may not be optimal despite being a built-in Windows feature (exposed through Universal Windows Platform (UWP) API). Therefore, let users choose the recongition strategy that best suits their needs and context.

Thanks.

Adriani90 · 2024-11-16T22:39:46Z

Great proposals, I propose a list which appears when pressing nvda+r, and which is populated with all recognition services available. When pressing enter on a service, the re cognition starts according to the settings for that specific service.

Emil-18 · 2024-11-19T22:57:46Z

@Adriani90 I personally would prefer that you select the recognition engin in settings instead. In that way, you can perform OCR on inaccessible objects that disappears when losing focus. An Alternative to this would be to make the selection menu virtual as well.

Adriani90 · 2024-11-19T23:15:42Z

I see your point. Alternatively maybe you can press a keystroke to cycle between available services before pressing nvda+r.Von meinem iPhone gesendetAm 19.11.2024 um 23:58 schrieb Emil-18 ***@***.***>: @Adriani90 I personally would prefer that you select the recognition engin in settings instead. In that way, you can perform OCR on inaccessible objects that disappears when losing focus. An Alternative to this would be to make the selection menu virtual as well. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

CyrilleB79 · 2024-11-20T07:04:51Z

I have also had the feeling that displaying a menu could have side effects due to focus changes, but I have not found concrete real examples.

@Emil-18 could you describe a concrete example of inaccessible objects that disappears when losing focus?

Also, I wonder if the following would not be better:

gather content recognition engines by categories, e.g. OCRs, image description engine, etc.
For each category in Recognition settings,, allow the user to select the engine to use. E.g. for OCR, choose between Windows OCR or Tesseract (if this add-on is installed).
Attribute one gesture per engine type, e.g. NVDA+R for OCR, NVDA+I for image recognition, etc.

Adriani90 · 2024-11-20T08:47:17Z

@CyrilleB79 would it not be better to have a cycling gesture and choose from a list in the content recognition settings which services should be included in the cycling command? Simmilar to speech modes?
In my view this would be more comfortable and would not take that much key strokes. i can imagine that the number of OCR services can become larger in the future.

CyrilleB79 · 2024-11-20T10:56:00Z

@CyrilleB79 would it not be better to have a cycling gesture and choose from a list in the content recognition settings which services should be included in the cycling command? Simmilar to speech modes? In my view this would be more comfortable and would not take that much key strokes. i can imagine that the number of OCR services can become larger in the future.

In my proposal, having many OCRs in the future would not be a problem: with NVDA+R, I will run my preferred OCR. I guess that the majority of people will have one preferred OCR and only use this. If a minority of people use 2 differents OCR, E.g. Windows OCR for general images on the web and Tesseract for specific images (e.g. containing tables), they can still use profiles.

But for recognizer types, I do not expect their number to grow so much. I have only two recognizer types in mind today: OCR and image description. Thus this makes only two gestures.

Adriani90 · 2024-11-20T13:03:13Z

@CyrilleB79 I see you point, in general it is ok to have these gestures, but why having to assign multiple gestures when you can have only one which provides the same functionality by cycling?

It is easier from an UX perspective
User do not have to bother that much with the input gestures dialog, which is already quite blown up.

I think still we should try to make processes as easy as it can be while trying to minimize as much as possible the amount of key strokes people have to keep in their minds.

Note following OCR services all provide APIs:
Google Vision
Free online OCR API
OCRopus
AWS
Calamari
Clarifai
DocuClipper
Optiic
ocr.space
Baidu OCR

to name only few of the services available apart from those you mentioned already. And if we start with two services, the community definitely will try out and propose new services as well in the future.

CyrilleB79 · 2024-11-20T14:36:16Z

@Adriani90 it seems that you do not totally understand my point.

There will not be multiple gestures to assign. You seem to mix recognition types with OCR services.
OCR is one recognizer type and image description is a second one. I suggest to keep NVDA+R for OCR and for example NVDA+I for image description. This makes only 2 gestures, not multiple ones.

In the settings dialog, you will have two combo-boxes:

The OCR engine/service combo-box: to select the OCR service you want to use when pressing NVDA+R. In this list you will have for example: Windows OCR, Tesseract, ocr.space, Baidu OCR, etc. (depending on the add-ons you have installed of course)
The Image description engine/service combo-box to select the image description service that will be launched upon an image description request with NVDA+I. In this list, you will have for example: AIContentDescriber (add-on), XPoseImage Captioner (add-on), Google vision, etc.

Depending on the selection in the combo-boxes, you may have other controls to set the settings of the selected recognizers.

Adriani90 · 2024-11-20T15:49:27Z

Got it now. Thanks for the clarification for the UX. Von meinem iPhone gesendetAm 20.11.2024 um 15:36 schrieb Cyrille Bougot ***@***.***>: @Adriani90 it seems that you do not totally understand my point. There will not be multiple gestures to assign. You seem to mix recognition types with OCR services. OCR is one recognizer type and image description is a second one. I suggest to keep NVDA+R for OCR and for example NVDA+I for image description. This makes only 2 gestures, not multiple ones. In the settings dialog, you will have two combo-boxes: The OCR engine/service combo-box: to select the OCR service you want to use when pressing NVDA+R. In this list you will have for example: Windows OCR, Tesseract, ocr.space, Baidu OCR, etc. (depending on the add-ons you have installed of course) The Image description engine/service combo-box to select the image description service that will be launched upon an image description request with NVDA+I. In this list, you will have for example: AIContentDescriber (add-on), XPoseImage Captioner (add-on), Google vision, etc. Depending on the selection in the combo-boxes, you may have other controls to set the settings of the selected recognizers. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

Adriani90 · 2024-11-20T21:40:34Z

We could then cycle between available engines with nvda+shift+r for ocr or NVDA+shift+i for image description.

CyrilleB79 · 2024-11-20T22:27:19Z

We could then cycle between available engines with nvda+shift+r for ocr or NVDA+shift+i for image description.

I agree that creating a cycling command for OCR engines and another one for image description engines could be useful for some. Though, I'd keep them unassigned. Because my assumption is that the people who use two engines of the same type (e.g. Windows OCR and Tesseract OCR) are a minority.

ruifontes · 2024-11-20T23:11:10Z

Just a comment...
TesseractOCR, by now, do not rcognize the screen content...
Maybe someday, but I don't foresee when

cary-rowen · 2024-11-20T23:47:53Z

Hi @CyrilleB79
Based on the LLM image description, the user may wish to provide a prompt. Could this be considered a third type?

cary-rowen · 2024-11-20T23:48:53Z

Add-ons relevant to this discussion:
https://github.com/huaiyinfeilong/xyOCR

CyrilleB79 · 2024-11-21T07:00:17Z

In my proposal, the prompt is not a third type of recognition service. It's just a parameter (setting) of one of them. It's worth noting that each service should be able to provide its specific parameters, e.g.:

for OCR services: language(s) if the OCR is multi-lingual, try to detect text orientation, etc.
for the image description: prompt, level of details
for online service (OCR or image description): API key

I wonder if the auto-update could remain a global parameter or not. Maybe not if one wants the OCR to auto-update, but the image description service not to do so.

seanbudd added p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority feature feature/ocr Related to Optical Character Recognition feature in NVDA triaged Has been triaged, issue is waiting for implementation. labels Nov 19, 2024

Adriani90 mentioned this issue Nov 21, 2024

General result handling in ContentRecog #11555

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content recognition (NVDA+R): let users choose from a list of available content recognition engines/strategies #17406

Content recognition (NVDA+R): let users choose from a list of available content recognition engines/strategies #17406

josephsl commented Nov 16, 2024

Adriani90 commented Nov 16, 2024

Emil-18 commented Nov 19, 2024

Adriani90 commented Nov 19, 2024 via email

CyrilleB79 commented Nov 20, 2024

Adriani90 commented Nov 20, 2024

CyrilleB79 commented Nov 20, 2024

Adriani90 commented Nov 20, 2024 •

edited

Loading

CyrilleB79 commented Nov 20, 2024

Adriani90 commented Nov 20, 2024 via email

Adriani90 commented Nov 20, 2024

CyrilleB79 commented Nov 20, 2024

ruifontes commented Nov 20, 2024

cary-rowen commented Nov 20, 2024

cary-rowen commented Nov 20, 2024

CyrilleB79 commented Nov 21, 2024

Content recognition (NVDA+R): let users choose from a list of available content recognition engines/strategies #17406

Content recognition (NVDA+R): let users choose from a list of available content recognition engines/strategies #17406

Comments

josephsl commented Nov 16, 2024

Background:

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Adriani90 commented Nov 16, 2024

Emil-18 commented Nov 19, 2024

Adriani90 commented Nov 19, 2024 via email

CyrilleB79 commented Nov 20, 2024

Adriani90 commented Nov 20, 2024

CyrilleB79 commented Nov 20, 2024

Adriani90 commented Nov 20, 2024 • edited Loading

CyrilleB79 commented Nov 20, 2024

Adriani90 commented Nov 20, 2024 via email

Adriani90 commented Nov 20, 2024

CyrilleB79 commented Nov 20, 2024

ruifontes commented Nov 20, 2024

cary-rowen commented Nov 20, 2024

cary-rowen commented Nov 20, 2024

CyrilleB79 commented Nov 21, 2024

Adriani90 commented Nov 20, 2024 •

edited

Loading