Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content recognition (NVDA+R): let users choose from a list of available content recognition engines/strategies #17406

Open
josephsl opened this issue Nov 16, 2024 · 15 comments
Labels
feature/ocr Related to Optical Character Recognition feature in NVDA feature p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority triaged Has been triaged, issue is waiting for implementation.

Comments

@josephsl
Copy link
Collaborator

Hi,

Follow-up to #17405:

Background:

In addition to Windows OCR, add-ons were developed to offer alternative content recognition strategies such as use of online AI/language model services. However, in ordre to use third party strategies, users must use the interface and commands offered by add-ons.

 ### Is your feature request related to a problem? Please describe.
At the moment NVDA+R (content recognition) relies on Windows OCR to perform text recognition.

Describe the solution you'd like

Allow users to use different strategies when performing content recognition, thereby allowing scenraios such as pressing NVDA+R to recognize content via third-party services.

Describe alternatives you've considered

Leave the experience as is.

Additional context

There might be cases where Windows OCR may not be optimal despite being a built-in Windows feature (exposed through Universal Windows Platform (UWP) API). Therefore, let users choose the recongition strategy that best suits their needs and context.

Thanks.

@Adriani90
Copy link
Collaborator

Great proposals, I propose a list which appears when pressing nvda+r, and which is populated with all recognition services available. When pressing enter on a service, the re cognition starts according to the settings for that specific service.

@seanbudd seanbudd added p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority feature feature/ocr Related to Optical Character Recognition feature in NVDA triaged Has been triaged, issue is waiting for implementation. labels Nov 19, 2024
@Emil-18
Copy link
Contributor

Emil-18 commented Nov 19, 2024

@Adriani90 I personally would prefer that you select the recognition engin in settings instead. In that way, you can perform OCR on inaccessible objects that disappears when losing focus. An Alternative to this would be to make the selection menu virtual as well.

@Adriani90
Copy link
Collaborator

Adriani90 commented Nov 19, 2024 via email

@CyrilleB79
Copy link
Collaborator

I have also had the feeling that displaying a menu could have side effects due to focus changes, but I have not found concrete real examples.

@Emil-18 could you describe a concrete example of inaccessible objects that disappears when losing focus?

Also, I wonder if the following would not be better:

  • gather content recognition engines by categories, e.g. OCRs, image description engine, etc.

  • For each category in Recognition settings,, allow the user to select the engine to use. E.g. for OCR, choose between Windows OCR or Tesseract (if this add-on is installed).

  • Attribute one gesture per engine type, e.g. NVDA+R for OCR, NVDA+I for image recognition, etc.

@Adriani90
Copy link
Collaborator

@CyrilleB79 would it not be better to have a cycling gesture and choose from a list in the content recognition settings which services should be included in the cycling command? Simmilar to speech modes?
In my view this would be more comfortable and would not take that much key strokes. i can imagine that the number of OCR services can become larger in the future.

@CyrilleB79
Copy link
Collaborator

@CyrilleB79 would it not be better to have a cycling gesture and choose from a list in the content recognition settings which services should be included in the cycling command? Simmilar to speech modes? In my view this would be more comfortable and would not take that much key strokes. i can imagine that the number of OCR services can become larger in the future.

In my proposal, having many OCRs in the future would not be a problem: with NVDA+R, I will run my preferred OCR. I guess that the majority of people will have one preferred OCR and only use this. If a minority of people use 2 differents OCR, E.g. Windows OCR for general images on the web and Tesseract for specific images (e.g. containing tables), they can still use profiles.

But for recognizer types, I do not expect their number to grow so much. I have only two recognizer types in mind today: OCR and image description. Thus this makes only two gestures.

@Adriani90
Copy link
Collaborator

Adriani90 commented Nov 20, 2024

@CyrilleB79 I see you point, in general it is ok to have these gestures, but why having to assign multiple gestures when you can have only one which provides the same functionality by cycling?

  • It is easier from an UX perspective
  • User do not have to bother that much with the input gestures dialog, which is already quite blown up.

I think still we should try to make processes as easy as it can be while trying to minimize as much as possible the amount of key strokes people have to keep in their minds.

Note following OCR services all provide APIs:
Google Vision
Free online OCR API
OCRopus
AWS
Calamari
Clarifai
DocuClipper
Optiic
ocr.space
Baidu OCR

to name only few of the services available apart from those you mentioned already. And if we start with two services, the community definitely will try out and propose new services as well in the future.

@CyrilleB79
Copy link
Collaborator

@Adriani90 it seems that you do not totally understand my point.

There will not be multiple gestures to assign. You seem to mix recognition types with OCR services.
OCR is one recognizer type and image description is a second one. I suggest to keep NVDA+R for OCR and for example NVDA+I for image description. This makes only 2 gestures, not multiple ones.

In the settings dialog, you will have two combo-boxes:

  • The OCR engine/service combo-box: to select the OCR service you want to use when pressing NVDA+R. In this list you will have for example: Windows OCR, Tesseract, ocr.space, Baidu OCR, etc. (depending on the add-ons you have installed of course)
  • The Image description engine/service combo-box to select the image description service that will be launched upon an image description request with NVDA+I. In this list, you will have for example: AIContentDescriber (add-on), XPoseImage Captioner (add-on), Google vision, etc.

Depending on the selection in the combo-boxes, you may have other controls to set the settings of the selected recognizers.

@Adriani90
Copy link
Collaborator

Adriani90 commented Nov 20, 2024 via email

@Adriani90
Copy link
Collaborator

We could then cycle between available engines with nvda+shift+r for ocr or NVDA+shift+i for image description.

@CyrilleB79
Copy link
Collaborator

We could then cycle between available engines with nvda+shift+r for ocr or NVDA+shift+i for image description.

I agree that creating a cycling command for OCR engines and another one for image description engines could be useful for some. Though, I'd keep them unassigned. Because my assumption is that the people who use two engines of the same type (e.g. Windows OCR and Tesseract OCR) are a minority.

@ruifontes
Copy link
Contributor

Just a comment...
TesseractOCR, by now, do not rcognize the screen content...
Maybe someday, but I don't foresee when

@cary-rowen
Copy link
Contributor

Hi @CyrilleB79
Based on the LLM image description, the user may wish to provide a prompt. Could this be considered a third type?

@cary-rowen
Copy link
Contributor

Add-ons relevant to this discussion:
https://github.com/huaiyinfeilong/xyOCR

@CyrilleB79
Copy link
Collaborator

In my proposal, the prompt is not a third type of recognition service. It's just a parameter (setting) of one of them. It's worth noting that each service should be able to provide its specific parameters, e.g.:

  • for OCR services: language(s) if the OCR is multi-lingual, try to detect text orientation, etc.
  • for the image description: prompt, level of details
  • for online service (OCR or image description): API key

I wonder if the auto-update could remain a global parameter or not. Maybe not if one wants the OCR to auto-update, but the image description service not to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/ocr Related to Optical Character Recognition feature in NVDA feature p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority triaged Has been triaged, issue is waiting for implementation.
Projects
None yet
Development

No branches or pull requests

7 participants