[py-tx] embeded tx hash for unletterboxing #1684

Mackay-Fisher · 2024-11-06T23:36:17Z

Preprocessing and Hashing Enhancements in ThreatExchange CLI

This update introduces new preprocessing functionality to remove black letterbox borders from images before hashing. This pr resolves issue #1666. Key changes include an unletterbox method in photo.py, command-line enhancements in hash_cmd.py for preprocessing control, and added tests in test_pdq_letterboxing.py. Integrated with rotation to allow for preprocessing in brute force look-up as well.

Summary of Changes

New `unletterbox` Method in `photo.py`

The unletterbox method preprocesses images by removing black letterbox borders, with options for customization:

Parameters:
- black_threshold: Sets the brightness threshold to detect black borders (default: 40).
- save_output: If True, saves the unletterboxed image as a new file with _unletterboxed appended to the filename.
Returns: Cropped image bytes, enabling flexibility for in-memory processing or file-based hashing.

Updated `HashCommand` in `hash_cmd.py`

The HashCommand class now supports additional command-line arguments to control preprocessing:

--preprocess: Specifies preprocessing type, currently unletterbox for black border removal and rotations see pr [pytx] Add --rotations to hash_cmd #1678 for more details.
--black-threshold: Sets brightness threshold for border detection, allowing sensitivity adjustments.
--save-output: saves the processed image as a new file, which is used for hashing. If False, hashing is done directly on the processed image bytes.

These parameters enable better control over image processing for different workflows.

Testing in `test_pdq_letterboxing.py`

Added a new unit test to python-threatexchange/threatexchange/tests/hashing/test_pdq_letterboxing.py to validate unletterbox functionality:

Key Tests:
- test_unletterbox_image: Confirms that unletterboxed image bytes match the expected PDQ hash.
- test_unletterboxfile_creates_output_file: Verifies file creation when save_output=True and checks for file existence.

These tests cover scenarios for both in-memory processing and output file creation workflows.

Usage Examples with Expected Hash Outputs

1. Basic Hashing without Preprocessing

tx hash -S pdq photo original.png
pdq facefacefaceface  # Hash for original image

tx hash -S pdq photo letterboxed.png
pdq 0000faceface0000  # Hash includes letterboxed borders

2. Hashing with sensitivity threshold

Preprocessing `letterboxed.png` with Custom `black-threshold`

tx hash -S pdq --preprocess=unletterbox --black-threshold=int photo letterboxed.png
pdq newfacefaceface  # hash adjusted for sensitivity of borders

3. Hashing with In-Memory Preprocessing

Preprocessing `letterboxed.png` with Default `black-threshold=40`

tx hash -S pdq --preprocess=unletterbox photo letterboxed.png
pdq facefacefaceface  # Matches `original.png` hash after removing letterbox borders

4. Hashing with Saved Output File

Saving Preprocessed Output with Default `black-threshold=40`

tx hash -S pdq --preprocess=unletterbox --save-output photo letterboxed.png
Unletterboxed image saved to: letterboxed_unletterboxed.png
pdq facefacefaceface  # Matches `original.png` hash

Summary of Usage

Parameters:
- --black-threshold: Controls border detection sensitivity. Lower values increase sensitivity, while higher values decrease sensitivity.
- --save-output: Saves the processed image as <original_file>_unletterboxed.png

…entation passes

Dcallies · 2024-11-08T15:01:31Z

Before I look at the code, thank you for the very comprehensive summary and test writeup. You could probably even skimp a little bit in future PRs!

I also like that you looked for other potential improvements that made sense in the area. I was considering suggesting a --save option during @haianhng31 --rotation diffs as well.

Mackay-Fisher · 2024-11-08T15:08:28Z

Of course, thank you!

I can also update the save output option and file output for the rotation if you would like.

Dcallies

As mentioned, thanks for the strong summary writeup.

Toplevel things:

I think there's a /data/ directory in python-threatexchange. We've also previously loaded test images from the pdq/data directory - there's a trick you can use to find relative imports from the current file path you can see in a couple of tests. What do you think about synthesizing a new letterboxed bridge photo from the bridge-mods directory and use the meta logo only for the manual testing for this diff? Why - I don't want folks to be concerned about image rights, and the bridge-mods are only for the purpose of this repo.
I think we should simplify the version of the interface we use on PhotoContent for now, and move the specific letterboxing into its own directory under content_type, or at least its own file. I forsee us adding more preprocessing tricks like this in the future.
Can you explain more how you came to choose the thresholds, .e.g 40?

I found this random library that does something similar - you might want to give their code a read for ideas: https://github.com/Animenosekai/bordercrop/blob/main/bordercrop/bordercrop.py, though as mentioned we are trying to limit the number of external dependencies, otherwise we could even just use the library as is. We don't need all the features they have.

Overall strong work, thanks for the effort put in!

Dcallies · 2024-11-08T15:02:20Z

python-threatexchange/threatexchange/cli/hash_cmd.py

        ap.add_argument(
            "--rotations",


(no changes needed): Do you think we should combine rotations into your generalization of preprocessing?

Alternatively, what do you think about making rotation and unletterboxing mutually exclusive?

Yes, I thought that it made the most sense to combine but didn't want to jump ahead and do it beforehand but I can add it to the list of actions.

Also, would there ever be a workflow where someone may want to both check for rotation and process the image for letterboxing?

python-threatexchange/threatexchange/cli/hash_cmd.py

Dcallies · 2024-11-08T16:06:41Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+            type=bool,
+            default=False,


nit: If you make the action store_true then the default is false IIRC.

Dcallies · 2024-11-08T16:08:01Z

python-threatexchange/threatexchange/cli/hash_cmd.py

@@ -118,7 +144,17 @@ def execute(self, settings: CLISettings) -> None:
        if not self.rotations:
            for file in self.files:
                for hasher in hashers:
-                    hash_str = hasher.hash_from_file(file)
+                    if isinstance(hasher, PdqSignal) and (


Why only PdqSignal? Wouldn't other image perceptual hashing algorithms benefit from this?

We also generally want to avoid places where we do isinstance(<interface_implementation>) in preference of it being handled by the interface itself.

Okay, I thought it was specific to PdqSignal and as for why I bypassed the interface it is because they do not all have the method hash from bytes and I did not always want to create the new file with updated images bytes. Is there a way I can go around this or would it be better to always create the new file even if temporarily, and then save the output if the user passes the flag to save it?

It's not specific to PdqSignal, and the hash_from_bytes method is itself part of a wider interface.

I think I eventually decided it was simpler to write everything to tmpfiles, which is how we ended up with the current implementation.

However, similar to the feedback I gave for --rotations, we can make our life a lot easier by having the preprocessing happen in between the file input and the for file in self.files.

I like your idea of providing a way to pass flag to save it, but let's save that for a followup.

Dcallies · 2024-11-08T16:08:26Z

python-threatexchange/threatexchange/cli/hash_cmd.py

@@ -118,7 +144,17 @@ def execute(self, settings: CLISettings) -> None:
        if not self.rotations:
            for file in self.files:
                for hasher in hashers:
-                    hash_str = hasher.hash_from_file(file)
+                    if isinstance(hasher, PdqSignal) and (
+                        self.content_type.get_name() == "photo"


Hmm, more specialization.

Dcallies · 2024-11-08T16:12:28Z

python-threatexchange/threatexchange/content_type/photo.py

+
+    @classmethod
+    def detect_top_border(
+        cls, grayscale_img: Image.Image, black_threshold: int = 10
+    ) -> int:
+        """
+        Detect the top black border by counting rows with only black pixels.
+        Uses a defualt black threshold of 10 so that only rows with pixel brightness
+        of 10 or lower will be removed.
+
+        Returns the first row that is not all blacked out from the top.
+        """
+        width, height = grayscale_img.size
+        for y in range(height):
+            row_pixels = list(grayscale_img.crop((0, y, width, y + 1)).getdata())
+            if all(pixel < black_threshold for pixel in row_pixels):
+                continue
+            return y
+        return height
+
+    @classmethod
+    def detect_bottom_border(
+        cls, grayscale_img: Image.Image, black_threshold: int = 10
+    ) -> int:
+        """
+        Detect the bottom black border by counting rows with only black pixels from the bottom up.
+        Uses a defualt black threshold of 10 so that only rows with pixel brightness
+        of 10 or lower will be removed.
+
+        Returns the first row that is not all blacked out from the bottom.
+        """
+        width, height = grayscale_img.size
+        for y in range(height - 1, -1, -1):
+            row_pixels = list(grayscale_img.crop((0, y, width, y + 1)).getdata())
+            if all(pixel < black_threshold for pixel in row_pixels):
+                continue
+            return height - y - 1
+        return height
+
+    @classmethod
+    def detect_left_border(
+        cls, grayscale_img: Image.Image, black_threshold: int = 10
+    ) -> int:
+        """
+        Detect the left black border by counting columns with only black pixels.
+        Uses a defualt black threshold of 10 so that only colums with pixel brightness
+        of 10 or lower will be removed.
+
+        Returns the first column from the left that is not all blacked out in the column.
+        """
+        width, height = grayscale_img.size
+        for x in range(width):
+            col_pixels = list(grayscale_img.crop((x, 0, x + 1, height)).getdata())
+            if all(pixel < black_threshold for pixel in col_pixels):
+                continue
+            return x
+        return width
+
+    @classmethod
+    def detect_right_border(
+        cls, grayscale_img: Image.Image, black_threshold: int = 10
+    ) -> int:
+        """
+        Detect the right black border by counting columns with only black pixels from the right.
+        Uses a defualt black threshold of 10 so that only colums with pixel brightness
+        of 10 or lower will be removed.
+
+        Returns the first column from the right that is not all blacked out in the column.
+        """
+        width, height = grayscale_img.size
+        for x in range(width - 1, -1, -1):
+            col_pixels = list(grayscale_img.crop((x, 0, x + 1, height)).getdata())
+            if all(pixel < black_threshold for pixel in col_pixels):
+                continue
+            return width - x - 1
+        return width


By putting these all at the top level, we are signaling that they are part of the "official" interface for photos.

Instead, let's move this functionality into its own file in a new /preprocessing directory. We can add unletterbox.py as its own module, with these 4 methods then as standalone.

Dcallies · 2024-11-08T16:13:04Z

python-threatexchange/threatexchange/content_type/photo.py

+
+    @classmethod
+    def unletterbox(
+        cls, file_path: Path, save_output: bool = False, black_threshold: int = 40


blocking: Instead of making save_output an argument here, it might be better to compose it from the outside by taking the bytes, which will give the caller more control over the file directory.

blocking q: Can you tell more about how you picked 40? We may want to be very conservative by default (even to only 100% black).

Dcallies · 2024-11-08T16:13:16Z

python-threatexchange/threatexchange/content_type/photo.py

+
+        Then removing the edges to give back a cleaned image bytes.
+
+        Return the new hash of the cleaned image with an option to create a new output file as well


I don't think this returns the hash, no?

Ahh no I had it returning the hash at first but then it created a circular dependency so I removed it but did not update the comment. I will update.

Dcallies · 2024-11-08T16:15:31Z

python-threatexchange/threatexchange/content_type/photo.py

+
+            # Convert the cropped image to bytes for hashing
+            with io.BytesIO() as buffer:
+                cropped_img.save(buffer, format=img.format)


Why .img? Should we use the same format that was passed in?

Dcallies · 2024-11-08T16:16:03Z

python-threatexchange/threatexchange/content_type/photo.py

+        """
+        # Open the original image
+        with Image.open(file_path) as img:
+            grayscale_img = img.convert("L")


blocking q: Hmm, why convert to grayscale first? Won't think convert some full colors to black?

I revised this and updated to check each individual value of the r g b pixel

Mackay-Fisher · 2024-11-13T06:10:27Z

What do you think about synthesizing a new letterboxed bridge photo from the bridge-mods directory and use the meta logo only for the manual testing for this diff? Why - I don't want folks to be concerned about image rights, and the bridge-mods are only for the purpose of this repo.

I will update the images.

Mackay-Fisher · 2024-11-13T06:12:04Z

I found this random library that does something similar - you might want to give their code a read for ideas: https://github.com/Animenosekai/bordercrop/blob/main/bordercrop/bordercrop.py, though as mentioned we are trying to limit the number of external dependencies, otherwise we could even just use the library as is. We don't need all the features they have.

This was super helpful I can go ahead and use the pillow library and implement it in a similar format for simplicity of dependencies and cleaner code. It should work the same way. However, if you would like me to go ahead and use this library I can do that as well.

Mackay-Fisher · 2024-11-13T16:06:54Z

I found this random library that does something similar - you might want to give their code a read for ideas: https://github.com/Animenosekai/bordercrop/blob/main/bordercrop/bordercrop.py, though as mentioned we are trying to limit the number of external dependencies, otherwise we could even just use the library as is. We don't need all the features they have.

This was super helpful I can go ahead and use the pillow library and implement it in a similar format for simplicity of dependencies and cleaner code. It should work the same way. However, if you would like me to go ahead and use this library I can do that as well.

Also, I looked more into it and this focuses on also being able to address image URL types and more than the basic image format that we are using. If that is by design I can add the pass-through to do that.

Dcallies

Some response to comments - use "request review" in the upper right hand of the summary page to send it back to me for review (little refresh logo).

python-threatexchange/threatexchange/cli/hash_cmd.py

Dcallies · 2024-11-13T16:45:51Z

python-threatexchange/threatexchange/cli/hash_cmd.py

@@ -118,7 +144,17 @@ def execute(self, settings: CLISettings) -> None:
        if not self.rotations:
            for file in self.files:
                for hasher in hashers:
-                    hash_str = hasher.hash_from_file(file)
+                    if isinstance(hasher, PdqSignal) and (


It's not specific to PdqSignal, and the hash_from_bytes method is itself part of a wider interface.

I think I eventually decided it was simpler to write everything to tmpfiles, which is how we ended up with the current implementation.

However, similar to the feedback I gave for --rotations, we can make our life a lot easier by having the preprocessing happen in between the file input and the for file in self.files.

I like your idea of providing a way to pass flag to save it, but let's save that for a followup.

…ved source files for unboxing and pytest

Dcallies

Getting closer!

blocking: Instead of adding the test file to /threatexchange, can you add it to https://github.com/facebook/ThreatExchange/tree/main/pdq/data/bridge-mods instead?

Because threatexchange is in the same repo as PDQ, you can read the pdq directory from ThreatExchange tests using the trick I noted inline with __file__.

Optional / For your consideration: At the rate we are going, this feels like it will take a few more passes. You can simplify the PR by breaking out the changes to unletterboxing.py and photo to its own PR (the bottom/simplest part of the stack), though splitting a PR in git is pretty daunting for the first time. Here's a random article describing a method using cherry-pick, but there are others.

Dcallies · 2024-11-14T00:31:02Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+            default=10,
+            help=(
+                "Set the black threshold for unletterboxing (default: 5)."


blocking: documentation for default seems off. There also might be a default argparse option that will display the default for you.

blocking q: Can you tell me how you chose 10 for this?

Dcallies · 2024-11-14T00:34:16Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+            "--save-output",
            action="store_true",
-            help="for photos, generate all 8 simple rotations",
+            help="If true, saves the processed image as a new file.",


blocking: To help a user understand what this option does, suggest naming it --save-preprocess

Since store_true doesn't take an argument, suggest this as an alternative help:

save the preprocessed image data as new files

Dcallies · 2024-11-14T00:35:51Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+                rotation_type = []
+                if self.photo_preprocess == "unletterbox":
+                    updated_bytes.append(
+                        PhotoContent.unletterbox(str(file), self.black_threshold)


fine as a followup: whoops, we should make unletterbox take a Path object for consistency with the rest of the library

Dcallies · 2024-11-14T00:38:37Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+                    with tempfile.NamedTemporaryFile() as temp_file:
+                        temp_file.write(bytes_data)
                        temp_file_path = pathlib.Path(temp_file.name)
                        for hasher in hashers:
                            hash_str = hasher.hash_from_file(temp_file_path)
                            if hash_str:
-                                print(rotation_type.name, hasher.get_name(), hash_str)
+                                print(
+                                    f"{rotation_type[idx].name if rotation_type else ''} {hasher.get_name()} {hash_str}"
+                                )
+                    if self.save_output:


ignorable: We can simplify this logic by using the delete= keyword of NamedTemporaryfile: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile

delete=not self.save_output

Dcallies · 2024-11-14T00:41:54Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+                        output_path = file.with_stem(f"{file.stem}{suffix}")
+                        with open(output_path, "wb") as output_file:
+                            output_file.write(bytes_data)
+                        print(f"Processed image saved to: {output_path}")


This might get a bit messy - you can include files from many locations. Additionally, do we know the format of the resulting image? Without the extension the file might not be usable.

Dcallies · 2024-11-14T00:54:08Z

python-threatexchange/threatexchange/content_type/photo.py

+            cropped_img = image.crop((left, top, width - right, height - bottom))
+
+            with io.BytesIO() as buffer:
+                cropped_img.save(buffer, format=image.format)


ignorable: Ah I see, we keep the original format, I like this choice.

Dcallies · 2024-11-14T00:55:04Z

python-threatexchange/threatexchange/content_type/preprocess/unletterboxing.py

@@ -0,0 +1,69 @@
+from PIL import Image


blocking: You need to add the Meta copyright header or I get a nagger add it myself - you can copy it from the other files

Dcallies · 2024-11-14T00:55:55Z

python-threatexchange/threatexchange/content_type/preprocess/unletterboxing.py

+    Check if each color channel in the pixel is below the threshold
+    """
+    r, g, b = pixel
+    return r < threshold and g < threshold and b < threshold


blocking q: Shouldn't this be <=? Your default threshold is 0. Can it be negative?

Dcallies · 2024-11-14T00:56:08Z

python-threatexchange/threatexchange/content_type/preprocess/unletterboxing.py

+from PIL import Image
+
+
+def is_pixel_black(pixel, threshold):


blocking: missing typing

Dcallies · 2024-11-14T01:02:23Z

python-threatexchange/threatexchange/content_type/preprocess/unletterboxing.py

+    """
+    width, height = image.size
+    for y in range(height):
+        row_pixels = list(image.crop((0, y, width, y + 1)).getdata())


I couldn't tell from reading the pillow docs, but can you use the returned core.image object as an iterator without wrapping it in a list?

Mackay-Fisher requested a review from Dcallies as a code owner November 6, 2024 23:36

facebook-github-bot added the CLA Signed label Nov 6, 2024

Mackay-Fisher changed the title ~~[py-tx] embeded tx hash passthrough for file generation and byte augm…~~ [py-tx] embeded tx hash for unletterboxing Nov 6, 2024

[py-tx] embeded tx hash passthrough for file generation and byte augm…

926801e

…entation passes

Mackay-Fisher force-pushed the Issue-1666-Letter-Unboxing branch from 5944b03 to 926801e Compare November 6, 2024 23:53

Dcallies requested changes Nov 8, 2024

View reviewed changes

Dcallies self-requested a review November 13, 2024 16:37

Dcallies requested changes Nov 13, 2024

View reviewed changes

Mackay-Fisher force-pushed the Issue-1666-Letter-Unboxing branch from b50a0f9 to 01c2bf9 Compare November 13, 2024 22:48

[py-tx] Updated for pr revisions added cobinded unletterboxing and mo…

9dba274

…ved source files for unboxing and pytest

Mackay-Fisher force-pushed the Issue-1666-Letter-Unboxing branch from 01c2bf9 to 9dba274 Compare November 13, 2024 22:56

Merge branch 'main' into Issue-1666-Letter-Unboxing

bfc10da

Mackay-Fisher requested a review from Dcallies November 13, 2024 23:00

Dcallies requested changes Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[py-tx] embeded tx hash for unletterboxing #1684

[py-tx] embeded tx hash for unletterboxing #1684

Mackay-Fisher commented Nov 6, 2024 •

edited

Loading

Dcallies commented Nov 8, 2024

Mackay-Fisher commented Nov 8, 2024

Dcallies left a comment •

edited

Loading

Dcallies Nov 8, 2024 •

edited

Loading

Mackay-Fisher Nov 13, 2024

Dcallies Nov 8, 2024

Dcallies Nov 8, 2024

Mackay-Fisher Nov 13, 2024 •

edited

Loading

Dcallies Nov 13, 2024

Dcallies Nov 8, 2024

Dcallies Nov 8, 2024

Dcallies Nov 8, 2024

Dcallies Nov 8, 2024

Mackay-Fisher Nov 13, 2024

Dcallies Nov 8, 2024

Dcallies Nov 8, 2024

Mackay-Fisher Nov 13, 2024

Mackay-Fisher commented Nov 13, 2024

Mackay-Fisher commented Nov 13, 2024

Mackay-Fisher commented Nov 13, 2024

Dcallies left a comment

Dcallies Nov 13, 2024

Dcallies left a comment •

edited

Loading

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024

Dcallies Nov 14, 2024


		Then removing the edges to give back a cleaned image bytes.

		Return the new hash of the cleaned image with an option to create a new output file as well

[py-tx] embeded tx hash for unletterboxing #1684

Are you sure you want to change the base?

[py-tx] embeded tx hash for unletterboxing #1684

Conversation

Mackay-Fisher commented Nov 6, 2024 • edited Loading

Preprocessing and Hashing Enhancements in ThreatExchange CLI

Summary of Changes

New unletterbox Method in photo.py

Updated HashCommand in hash_cmd.py

Testing in test_pdq_letterboxing.py

These tests cover scenarios for both in-memory processing and output file creation workflows.

Usage Examples with Expected Hash Outputs

1. Basic Hashing without Preprocessing

2. Hashing with sensitivity threshold

Preprocessing letterboxed.png with Custom black-threshold

3. Hashing with In-Memory Preprocessing

Preprocessing letterboxed.png with Default black-threshold=40

4. Hashing with Saved Output File

Saving Preprocessed Output with Default black-threshold=40

Summary of Usage

Dcallies commented Nov 8, 2024

Mackay-Fisher commented Nov 8, 2024

Dcallies left a comment • edited Loading

Choose a reason for hiding this comment

Dcallies Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mackay-Fisher Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mackay-Fisher commented Nov 13, 2024

Mackay-Fisher commented Nov 13, 2024

Mackay-Fisher commented Nov 13, 2024

Dcallies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dcallies left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mackay-Fisher commented Nov 6, 2024 •

edited

Loading

New `unletterbox` Method in `photo.py`

Updated `HashCommand` in `hash_cmd.py`

Testing in `test_pdq_letterboxing.py`

Preprocessing `letterboxed.png` with Custom `black-threshold`

Preprocessing `letterboxed.png` with Default `black-threshold=40`

Saving Preprocessed Output with Default `black-threshold=40`

Dcallies left a comment •

edited

Loading

Dcallies Nov 8, 2024 •

edited

Loading

Mackay-Fisher Nov 13, 2024 •

edited

Loading

Dcallies left a comment •

edited

Loading