Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HEIC files (convert to JPEG) #1745

Open
vid-bin opened this issue Oct 24, 2023 · 13 comments
Open

Add support for HEIC files (convert to JPEG) #1745

vid-bin opened this issue Oct 24, 2023 · 13 comments
Labels
feature_request for feature request

Comments

@vid-bin
Copy link

vid-bin commented Oct 24, 2023

I'm using the 2.10-snapshot and I'm scanning my library but it doesn't appear to be indexing heic files. I have heic added as an included filetype in the config file.

According to https://fscrawler.readthedocs.io/en/latest/user/formats.html it supports everything Tika supports.

Tika has heic/heif as a supported format. My system is also configured with support for heic/heif files.

Any idea why it's not working?

@vid-bin vid-bin added the check_for_bug Needs to be reproduced label Oct 24, 2023
@dadoonet
Copy link
Owner

No idea. Could you share the file you are trying to index?

@vid-bin
Copy link
Author

vid-bin commented Oct 24, 2023

Yes, attached.

After looking into this some more it does appear to index the file but it doesn't process OCR and the file isn't searchable.

test.heic.zip

@dadoonet
Copy link
Owner

Ok. So could you run fscrawler with --debug --restart options and share the full logs here?
Please have only one file in the directory to avoid too many logs ;)

@vid-bin
Copy link
Author

vid-bin commented Oct 24, 2023

I will do this when able, probably won’t be until tomorrow.

From some research it looks like tesseract doesnt support heic. Is it possible to code fscrawler to generate a temporary jpeg file of the image so tesseract can run ocr on it and then remove the temporary file?

@dadoonet
Copy link
Owner

From some research it looks like tesseract doesnt support heic. Is it possible to code fscrawler to generate a temporary jpeg file of the image so tesseract can run ocr on it and then remove the temporary file?

FSCrawler is "just" using OCR provided by Tika. So may be you should open an issue in the Tika issue tracker for this?

@vid-bin
Copy link
Author

vid-bin commented Oct 24, 2023

Okay that probably isn’t the issue then. Please give me some time to get home and run the debug command before closing the issue.

jpeg images process correctly and can be searched, so seems to be an issue specific to heic.

@dadoonet
Copy link
Owner

I think you are right with this:

From some research it looks like tesseract doesnt support heic. Is it possible to code fscrawler to generate a temporary jpeg file of the image so tesseract can run ocr on it and then remove the temporary file?

But I think the best place to support such a thing is in Tika...

My 2 cents

@vid-bin
Copy link
Author

vid-bin commented Oct 25, 2023

It is indexing the characters but it seems to be not saving them or something to Elasticsearch? I'm not sure. The content doesn't show up when querying the Elasticsearch backend directly or with samba+elasticsearch. However, when using a jpeg file it works as intended.

6:17:39,184 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1003.9mb/15.6gb=6.25%], RAM [11.2gb/62.7gb=18.0%], Swap [0b/0b=0.0].
16:17:39,185 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
16:17:39,185 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
16:17:39,185 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_wpsearch_settings.json] already exists
16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [8/_settings.json] already exists
16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [8/_settings_folder.json] already exists
16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [8/_wpsearch_settings.json] already exists
16:17:39,186 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [icloud]...
16:17:39,187 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [icloud]...
16:17:39,347 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
16:17:39,409 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
16:17:39,427 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/home/thisuserhere/Desktop/fscrawler/fscrawler-distribution-2.10-SNAPSHOT/lib/log4j-slf4j-impl-2.21.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
16:17:39,652 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 7.17.10 and 7 as the major version number
16:17:39,652 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 7.17.10
16:17:39,654 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service started
16:17:39,655 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
16:17:39,656 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version
16:17:39,680 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 7.17.10 and 7 as the major version number
16:17:39,680 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 7.17.10
16:17:39,680 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service started
16:17:39,681 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [thisuserhere]
16:17:39,689 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running PUT http://127.0.0.1:9200/thisuserhere: {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [thisuserhere/Wz3K5zsZQj6-Wz7Nxq3siA] already exists","index_uuid":"Wz3K5zsZQj6-Wz7Nxq3siA","index":"thisuserhere"}],"type":"resource_already_exists_exception","reason":"index [thisuserhere/Wz3K5zsZQj6-Wz7Nxq3siA] already exists","index_uuid":"Wz3K5zsZQj6-Wz7Nxq3siA","index":"thisuserhere"},"status":400}
16:17:39,689 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Response for create index [thisuserhere]: HTTP 400 Bad Request
16:17:39,689 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [thisuserhere_folder]
16:17:39,692 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running PUT http://127.0.0.1:9200/thisuserhere_folder: {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [thisuserhere_folder/w6w1RObWQ_2XClZVw2bHWA] already exists","index_uuid":"w6w1RObWQ_2XClZVw2bHWA","index":"thisuserhere_folder"}],"type":"resource_already_exists_exception","reason":"index [thisuserhere_folder/w6w1RObWQ_2XClZVw2bHWA] already exists","index_uuid":"w6w1RObWQ_2XClZVw2bHWA","index":"thisuserhere_folder"},"status":400}
16:17:39,692 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Response for create index [thisuserhere_folder]: HTTP 400 Bad Request
16:17:39,693 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [thisuserhere] for [/home/thisuserhere/storage/iCloud/temp] every [3m]
16:17:39,693 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [thisuserhere] for [/home/thisuserhere/storage/iCloud/temp] every [3m]
16:17:39,694 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [thisuserhere] is now running. Run #1...
16:17:39,700 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [/home/thisuserhere/storage/iCloud/temp] content
16:17:39,700 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from /home/thisuserhere/storage/iCloud/temp
16:17:39,703 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 2 local files found
16:17:39,703 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/.DS_Store) = /.DS_Store
16:17:39,703 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/.DS_Store], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]], excludes = [[/~]]
16:17:39,704 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/.DS_Store], excludes = [[/~]]
16:17:39,704 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/.DS_Store], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]]
16:17:39,704 DEBUG [f.p.e.c.f.FsParserAbstract] [/.DS_Store] can be indexed: [false]
16:17:39,704 DEBUG [f.p.e.c.f.FsParserAbstract] - ignored file/dir: .DS_Store
16:17:39,704 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/IMG_9543.heic) = /IMG_9543.heic
16:17:39,705 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/IMG_9543.heic], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]], excludes = [[/~]]
16:17:39,705 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/IMG_9543.heic], excludes = [[/~]]
16:17:39,705 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/IMG_9543.heic], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]]
16:17:39,705 DEBUG [f.p.e.c.f.FsParserAbstract] [/IMG_9543.heic] can be indexed: [true]
16:17:39,705 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /IMG_9543.heic
16:17:39,705 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/thisuserhere/storage/iCloud/temp],[IMG_9543.heic]
16:17:39,706 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/IMG_9543.heic) = /IMG_9543.heic
16:17:39,710 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
16:17:39,711 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
16:17:39,721 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
16:17:39,733 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found.
16:17:39,734 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
16:17:40,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/IMG_9543.heic) = /IMG_9543.heic
16:17:40,605 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing thisuserhere/IMG_9543.heic?pipeline=null
16:17:40,606 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [/home/thisuserhere/storage/iCloud/temp]...
16:17:40,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/IMG_9553.heic) = /IMG_9553.heic
16:17:40,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/IMG_9553.heic], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]], excludes = [[/~]]
16:17:40,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/IMG_9553.heic], excludes = [[/~]]
16:17:40,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/IMG_9553.heic], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]]
16:17:40,615 DEBUG [f.p.e.c.f.FsParserAbstract] Deleting thisuserhere/IMG_9553.heic
16:17:40,616 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Deleting thisuserhere/IMG_9553.heic
16:17:40,616 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [/home/thisuserhere/storage/iCloud/temp]...
16:17:40,621 INFO [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
16:17:40,696 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [thisuserhere]
16:17:40,697 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
16:17:40,697 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager
16:17:40,697 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor
16:17:40,697 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed
16:17:40,701 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped
16:17:40,701 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager
16:17:40,701 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor
16:17:40,701 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed
16:17:40,702 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Executing [2] remaining actions
16:17:40,702 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 2 actions
16:17:40,708 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [2] documents to the Elasticsearch service
16:17:40,708 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 2687 characters
16:17:40,742 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 2 actions
16:17:40,743 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service stopped
16:17:40,743 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
16:17:40,743 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [thisuserhere] stopped
16:17:40,744 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [thisuserhere]
16:17:40,745 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
16:17:40,745 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager
16:17:40,745 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped
16:17:40,745 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager
16:17:40,745 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service stopped
16:17:40,745 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
16:17:40,745 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [thisuserhere] stopped

@dadoonet
Copy link
Owner

Great. Could you run the same thing again with --trace --restart and share again the logs?
You can just share what is between:

DEBUG [f.p.e.c.f.FsParserAbstract] [/IMG_9543.heic] can be indexed: [true]

and

DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [/home/thisuserhere/storage/iCloud/temp]...

@vid-bin
Copy link
Author

vid-bin commented Oct 25, 2023

Okay so it's definitely not parsing OCR on the images (lang detected null, etc) but everything else seems to work.

01:42:33,973 DEBUG [f.p.e.c.f.FsParserAbstract] [/IMG_4000.heic] can be indexed: [true]
01:42:33,974 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /IMG_4000.heic
01:42:33,974 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/userhere/Array/temp],[IMG_4000.heic]
01:42:33,975 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/userhere/Array/temp, /home/userhere/Array/temp/IMG_4000.heic) = /IMG_4000.heic
01:42:33,997 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/home/userhere/Array/temp/IMG_4000.heic]
01:42:34,020 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
01:42:34,029 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
01:42:34,030 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
01:42:34,077 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
01:42:34,204 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found.
01:42:34,204 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
01:42:35,216 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
01:42:35,697 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [: NONE (0.000000)]
01:42:35,700 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
01:42:35,700 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matches.
01:42:35,700 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/userhere/Array/temp, /home/userhere/Array/temp/IMG_4000.heic) = /IMG_4000.heic
01:42:35,713 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing userhere/IMG_4000.heic?pipeline=null
01:42:35,714 TRACE [f.p.e.c.f.c.ElasticsearchClient] JSon indexed : {"meta":{"date":"2023-10-23T17:05:50.000+00:00","created":"2023-10-23T17:05:50.000+00:00","raw":{"ICC:Profile Connection Space":"XYZ","Minor Version":"0","ICC:Profile Copyright":"1 enUS(Copyright Apple Inc., 2016)","Exif SubIFD:Time Zone Digitized":"-07:00","X-TIKA:Parsed-By-Full-Set":"org.apache.tika.parser.DefaultParser","ICC:Class":"Input Device","ICC:Unknown tag (0x61617079)":"data (0x64617461): 14 bytes","ICC:Device manufacturer":"APPL","Exif SubIFD:Exif Image Width":"1290 pixels","ICC:Signature":"acsp","Exif SubIFD:User Comment":"Screenshot","ICC:Media White Point":"(0.9642, 1, 0.8251)","ICC:CMM Type":"appl","Exif SubIFD:Sub-Sec Time Original":"000","resourceName":"IMG_4000.heic","ICC:Version":"4.0.0","Exif IFD0:Orientation":"Top, left side (Horizontal / normal)","tiff:Orientation":"1","Major Brand":"heic","ICC:Profile Size":"30252","X-TIKA:Parsed-By":"org.apache.tika.parser.DefaultParser","Bits Per Channel":"8 8 8","ICC:Tag Count":"8","Exif IFD0:Date/Time":"2023:10:23 10:05:50","Exif SubIFD:Time Zone":"-07:00","tiff:ImageLength":"2796","dcterms:created":"2023-10-23T10:05:50","dcterms:modified":"2023-10-23T10:05:50","Exif SubIFD:Sub-Sec Time":"000","ICC:Profile Date/Time":"2016:01:01 00:00:00","Compatible Brands":"mif1 miaf MiHB heic","Exif SubIFD:Color Space":"sRGB","ICC:Profile Description":"1 enUS(Apple Wide Color Sharing Profile)","ICC:AToB 0":"mAB (0x6D414220): 29772 bytes","ICC:AToB 1":"mAB (0x6D414220): 29772 bytes","ICC:AToB 2":"mAB (0x6D414220): 29772 bytes","Height":"512 pixels","Width":"512 pixels","ICC:Color space":"RGB","Content-Type":"image/heic","Exif SubIFD:Date/Time Original":"2023:10:23 10:05:50","Exif SubIFD:Sub-Sec Time Digitized":"000","ICC:XYZ values":"0.964 1 0.825","exif:DateTimeOriginal":"2023-10-23T10:05:50","Rotation":"0 degrees","Exif SubIFD:Time Zone Original":"-07:00","Exif SubIFD:Exif Image Height":"2796 pixels","ICC:Primary Platform":"Apple Computer, Inc.","ICC:Chromatic Adaptation":"sf32 (0x73663332): 44 bytes","Exif SubIFD:Date/Time Digitized":"2023:10:23 10:05:50","tiff:ImageWidth":"1290"}},"file":{"extension":"heic","content_type":"image/heic","created":"2023-10-24T23:17:14.007+00:00","last_modified":"2023-10-24T23:17:14.007+00:00","last_accessed":"2023-10-25T07:15:32.774+00:00","indexing_date":"2023-10-25T08:42:33.975+00:00","filesize":243979,"filename":"IMG_4000.heic","url":"file:///home/userhere/Array/temp/IMG_4000.heic"},"path":{"root":"ead0d21913015c4a9d9472e67e9e2d","virtual":"/IMG_4000.heic","real":"/home/userhere/Array/temp/IMG_4000.heic"}}
01:42:35,714 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [/home/userhere/Array/temp]...

@dadoonet
Copy link
Owner

Not supported by tesseract.

tesseract-ocr/tesseract#2930

@vid-bin
Copy link
Author

vid-bin commented Oct 29, 2023

Any chance you could do a band-aid solution on fscrawler by generating temporary jpeg versions to scan? This would be great for heic, jpeg-xl and avif formats.

It's frustrating because the Mac generates OCR output locally with spotlight when using heic.

Alternatively is there any chance you could code fscrawler to use whatever the Mac is using to generate heic ocr? I wouldn't mind running fscrawler on my Mac and connecting remotely to the elasticsearch server.

@dadoonet
Copy link
Owner

Alternatively is there any chance you could code fscrawler to use whatever the Mac is using to generate heic ocr?

Are you aware of any library which would allow this?

@dadoonet dadoonet added feature_request for feature request and removed check_for_bug Needs to be reproduced labels Jan 15, 2024
@dadoonet dadoonet changed the title Not scanning HEIC files Add support for HEIC files (convert to JPEG) Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature_request for feature request
Projects
None yet
Development

No branches or pull requests

2 participants