-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilingual support #1699
Comments
Hi @decadance-dance 👋, Have you already tried: Depends a bit if there is any data from mindee we could use. |
Hi, @felixdittrich92 |
Ah let's keep this issue open there is more todo i think :) |
Happy about an feedback how it works for you :) |
Unfortunately, we don't have such data |
@decadance-dance In general we would need the help of the community to collect documents (newspaper, receipt photos, etc.) in divers langauges (can be unlabeled). / This would need a license to sign that we can freely use this data. But not sure how to trigger such "event" 😅 @odulcy-mindee |
Hello =) |
Moreover it should be interesting for Chinese detection models to add multiple recognition data in the same image without intersection. This should help for a Chinese detection model to perform better without real detection data. |
Hi @nikokks 😃 To collect multilingual data for detection is troublesome because it should be real data (or if possible really good generated ones / for example with a fine tuned FLUX model maybe !?) |
Do you can estimate how much data we need to provide multilingual capabilities on the same level as only english ocr is? |
Hi @decadance-dance 👋, I think if we could collect ~100-150 different types of documents for each language we would have a good starting point (at the end the language doesn't matter it's more about the different char sets / fonts / text sizes) - for example: At the end it's more critical to take care that we really can use such images legally. The tricky part is the detection because we need complete real data .. if we have this it should be much easier for the recognition part we could create some synth data and eval on the already collected real data. I think if we are able to collect the data up to end of january i could provide pre-labeling via Azure's Document AI. Currently missing parts are:
Lang list: https://github.com/eymenefealtun/all-words-in-all-languages |
@felixdittrich92, thank you for a detailed answer. |
@decadance-dance Not yet ..maybe the easiest would be to create a huggingface space for this because from this you could also do easily pictures from your smartphone and under the hood we push the taken or uploaded images into an HF dataset. In this case we could also add an agreement before any data can be uploaded that the person who uploads agrees to have all rights on the image and uploads the image with the knowledge to provide the uploaded images openly to everyone who downloads the dataset. Wdyt ? Again CC @odulcy-mindee :D |
I found one possible dataset for printed documents for multiple languages. It is wikisource. They have text and images at the page level, originally created using some existing OCR(Google vision/tesseract) and the data has then been corrected/proofread by people. They have annotations to differentiate what has been proofread and what has not been. An example - https://te.wikisource.org/wiki/పుట%3AAandhrakavula-charitramu.pdf/439. The license would be CC-BY-SA and I am expecting them to only have pulled books for which copyright has expired. Collecting fonts for various languages is a bigger problem though( because of licenses ). |
Thanks @ramSeraph for sharing i will have a look 👍 I created a space which can be used to collect some data (only raw data for starting) wdyt ? Later on if we say we have collected enough raw data we can filter the data and pre-label with Azure Document AI. |
Sounds good to me. Thanks
чт, 17 окт. 2024 г. в 16:37, Felix Dittrich ***@***.***>:
… Thanks @ramSeraph <https://github.com/ramSeraph> for sharing i will have
a look 👍
@decadance-dance <https://github.com/decadance-dance> @nikokks
<https://github.com/nikokks>
I created a space which can be used to collect some data (only raw data
for starting) wdyt ?
https://huggingface.co/spaces/Felix92/docTR-multilingual-Datacollector
Later on if we say we have collected enough raw data we can filter the
data and pre-label with Azure Document AI.
—
Reply to this email directly, view it on GitHub
<#1699 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AURNXMCDXAODUPBYE6BNUQLZ37DTLAVCNFSM6AAAAABMZ2E2AGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZG4ZTIMRQGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@decadance-dance @nikokks @ramSeraph @allOther I created an request to the mindee team to provide support on this task. Would be nice if you could write a comment in the thread about your needs to support this 🙏 |
First stage would be to improve the detection models, for the sec stage the recognition part we could generate additional synthetic data |
Short update here: I collected ~30k samples containing: Now i need to find a way to annotate all these data - AWS Textract & Azure Document AI failed as possible useful prelabeling solution Best results reached with docTR/OnnxTR (only detection) - but still to much issues to include it directly into our dataset for pretraining. |
Why did they faile? |
Detection results was really worse for many samples |
how do you think what way of generating synth word text is more beneficial? |
How did you evaluate them? As I understood your data is not annotated yet. |
maybe easy-ocr will work for you? |
I OCR'd some samples with Azure Document AI and Textract and wrote a script to visualize these samples for OnnxTR i prelabeled all Files and also checked the Same files manually |
Haven't tested yet with this data but if i remember docTR was in the most cases more accure |
I would go with option b and augment a fixed part of this data (words) with low frequent characters (like the % symbol). I did the same to train the multilingual parseq model :) |
I think the only option is to label a part of the data manually -> fine tune -> pre-label -> correct and again in an iterative process 🙈😅 (really time consuming) |
🚀 The feature
Support of multiple languages (accordingly VOCABS["multilingual"]) by pretrained models.
Motivation, pitch
It would be great to use models which supports multiple languages because it significantly improve user experience in various cases.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: