-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HEIC files (convert to JPEG) #1745
Comments
No idea. Could you share the file you are trying to index? |
Yes, attached. After looking into this some more it does appear to index the file but it doesn't process OCR and the file isn't searchable. |
Ok. So could you run fscrawler with |
I will do this when able, probably won’t be until tomorrow. From some research it looks like tesseract doesnt support heic. Is it possible to code fscrawler to generate a temporary jpeg file of the image so tesseract can run ocr on it and then remove the temporary file? |
FSCrawler is "just" using OCR provided by Tika. So may be you should open an issue in the Tika issue tracker for this? |
Okay that probably isn’t the issue then. Please give me some time to get home and run the debug command before closing the issue. jpeg images process correctly and can be searched, so seems to be an issue specific to heic. |
I think you are right with this:
But I think the best place to support such a thing is in Tika... My 2 cents |
It is indexing the characters but it seems to be not saving them or something to Elasticsearch? I'm not sure. The content doesn't show up when querying the Elasticsearch backend directly or with samba+elasticsearch. However, when using a jpeg file it works as intended.
|
Great. Could you run the same thing again with
and
|
Okay so it's definitely not parsing OCR on the images (lang detected null, etc) but everything else seems to work.
|
Not supported by tesseract. |
Any chance you could do a band-aid solution on fscrawler by generating temporary jpeg versions to scan? This would be great for heic, jpeg-xl and avif formats. It's frustrating because the Mac generates OCR output locally with spotlight when using heic. Alternatively is there any chance you could code fscrawler to use whatever the Mac is using to generate heic ocr? I wouldn't mind running fscrawler on my Mac and connecting remotely to the elasticsearch server. |
Are you aware of any library which would allow this? |
I'm using the 2.10-snapshot and I'm scanning my library but it doesn't appear to be indexing heic files. I have heic added as an included filetype in the config file.
According to https://fscrawler.readthedocs.io/en/latest/user/formats.html it supports everything Tika supports.
Tika has heic/heif as a supported format. My system is also configured with support for heic/heif files.
Any idea why it's not working?
The text was updated successfully, but these errors were encountered: