Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dependency on native Tesseract OCR executable for pytesseract #1348

Merged

Conversation

TikhonJelvis
Copy link
Contributor

The pytesseract package needs to have the tesseract executable available at runtime to work.

By default, the Python package looks for the tesseract executable in the PATH. This doesn't work here, so we need to override the tesseract_cmd with the path to the tesseract executable we pulled in from Nix. I did this with a patch based on how pytesseract is set up in Nixpkgs.

The patching code feels a bit fiddly. I don't know the idiomatic way to do this sort of thing.

I included a test that will fail if pytesseract cannot find the tesseract executable. The test passed for me with both preferWheels = true and preferWheels = false, but I only included one in the test suite here, not sure if it makes sense to have both—the actual patching code had to be a bit different depending on whether the source was a wheel or not.

The pytesseract package needs to have the `tesseract` executable available at runtime to work.

By default, the Python package looks for the `tesseract` executable in the PATH. This doesn't work here, so we need to override the `tesseract_cmd` with the path to the `tesseract` executable we pulled in from Nix. I did this with a patch [based on how pytesseract is set up in Nixpkgs][1].

The patching code feels a bit fiddly. I don't know the idiomatic way to do this sort of thing.

I included a test that will fail if pytesseract cannot find the `tesseract` executable. The test passed for me with both `preferWheels = true` and `preferWheels = false`, but I only included one in the test suite here, not sure if it makes sense to have both—the actual patching code had to be a bit different depending on whether the source was a wheel or not.

[1]: https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/python-modules/pytesseract/tesseract-binary.patch
'';
in
super.pytesseract.overridePythonAttrs (old: {
buildInputs = (old.buildInputs or [ ]) ++ [ pkgs.tesseract4 ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not need to be a propagatedBuildInput, to be available at runtime? In that case you might not need the patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, let me try it. I saw that the pytesseract default.nix in Nixpkgs used buildInputs, but I don't know if that's the best way to do it.

For reproducibility it seems better to have an explicit path to the executable rather than relying on the PATH variable where the script runs, so maybe that's the consideration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I tried changing it to propagatedBuildInputs and removing the patch, and the test case I wrote failed:

pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

@cpcloud cpcloud added this pull request to the merge queue Oct 21, 2023
Merged via the queue into nix-community:master with commit 67dade9 Oct 21, 2023
118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants