You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for creating marker - it's an excellent and very useful tool for PDF text extraction. The quality of the OCR and the overall functionality is impressive.
While using this great tool, I noticed a couple of areas where the user experience could potentially be enhanced:
Language Code Documentation Inconsistency:
In the README.md under "Convert a single file" > Options, the example shows:
--languages TEXT : Optionally specify which languages to use for OCR processing.
Example: --languages "eng,fra,deu" for English, French, and German.
However, Surya OCR actually uses "en" (not "eng"), resulting in:
KeyError: 'eng'
The "here" link in documentation shows "en" as the correct code, which creates some confusion
Process Flow Optimization Opportunity:
Current behavior with large documents (e.g., 700+ pages):
User runs command with incorrect language code
System performs full bbox detection process (taking several minutes)
Error about invalid language code is reported only after this process
This creates a longer than necessary wait time for users when there's a simple parameter error
Suggested Improvements:
Documentation:
Either update README to use "en" instead of "eng"
Or modify Surya's language mapping to accept both codes
Process Flow:
Add early validation for command arguments before processing starts
Validate:
Language codes
File paths
Other parameter syntax
Expected behavior:
$ marker_single input.pdf --languages "eng"
Error: Invalid language code "eng". Available codes are: "en" (English), "de" (German), etc.
Benefits:
Even better user experience for this already great tool
Saves processing time and computational resources
Clearer documentation for new users
Would you consider implementing these improvements to further enhance this valuable tool?
Thank you again for maintaining this excellent project!
The text was updated successfully, but these errors were encountered:
codeplay1997
changed the title
Documentation Inconsistency: Language Code for English ("eng" vs "en")
Suggestions for Improving User Experience in This Great Tool
Dec 21, 2024
First of all, thank you for creating marker - it's an excellent and very useful tool for PDF text extraction. The quality of the OCR and the overall functionality is impressive.
While using this great tool, I noticed a couple of areas where the user experience could potentially be enhanced:
Language Code Documentation Inconsistency:
Process Flow Optimization Opportunity:
Suggested Improvements:
Documentation:
Process Flow:
Benefits:
Would you consider implementing these improvements to further enhance this valuable tool?
Thank you again for maintaining this excellent project!
The text was updated successfully, but these errors were encountered: