[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

aokolish · 2024-10-24T20:58:56Z

I'm testing out hasher-matcher-actioner + tech against terrorism (TAT) API and noticed a log line from the server:

INFO in fetcher: TAT_HASHES[tat] Is a NoCheckpointing class, which hopefully is a test type, and we have a checkpoint. Considering complete

This leads me to believe that HMA would never pull new hashes from the TAT API. Can this somehow be fixed in the TAT exchange implementation?

Here are their API docs - https://www.terrorismanalytics.org/docs/hash-list-v1

There are probably easier steps to reproduce, but my steps were...

deploy HMA
configure the TAT exchange with prod credentials
run HMA fetcher
check logs for this message

The text was updated successfully, but these errors were encountered:

Bruce-Pouncey-TAT · 2024-10-24T23:09:02Z

Hi @aokolish
Bruce here from TAT, perhaps I can be of assistance.

Our hash list API delivers a JSON file which is updated on a nightly basis as you have seen in our documentation.
In this scenario a checkpoint would be difficult to keep track of as the list is delivered in full every time via a single request.

In the current implementation running threatexchange fetch would download the entire hash list again along with any new hashes that would have not been included in the previous fetch. We want the system to assume the list stale on every fetch.

I'm happy to go over this further to help you find a solution, or if there is something I'm misunderstanding.

Dcallies · 2024-10-28T17:45:37Z

Hey @aokolish , @Bruce-Pouncey-TAT - I looked into this briefly, and this might be a missing functionality in HMA, which currently can't handle APIs like TATs that doesn't handle deltas but also changes over time. Since there is no efficient way to discover updated records with this kind of API, we'd need to write something new in HMA to load all the previously downloaded records from TAT in memory, then create a diff itself (especially removals). We could also force clear all the hashes every time, but as the number of hashes grows over time, this creates weird inconsistencies in the database (hashes disappearing and reappearing in the index) that might have real production impact.

This feature doesn't exist today, and so by default TAT isn't correctly supported by HMA, hence the logs that Alex is seeing.

To evaluate the potential solutions:

@Bruce-Pouncey-TAT - my top recommendation is to switch to a delta-based API like NCMEC, StopNCII, and ThreatExchange on the Tech Against Terrorism side. In the long term, your users will thank you, as the cost of keeping a correct copy grows with the size of your database, and it's not too hard as long as you are storing hashes in a backing database. I have implemented multiple versions of this type of API, and helped other programs make this same jump. This is by far the easiest solution.
If whatever reason TAT can't update their API, I can more fully describe an implementation in the opening paragraph and add it to an issue, for someone to attempt.

aokolish changed the title ~~[py-tx] TAT~~ [py-tx] Cannot fetch TAT hashes more than once Oct 24, 2024

Dcallies changed the title ~~[py-tx] Cannot fetch TAT hashes more than once~~ [py-tx][tat] TAT API implementation doesn't work correctly with HMA Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

aokolish commented Oct 24, 2024 •

edited

Loading

Bruce-Pouncey-TAT commented Oct 24, 2024 •

edited

Loading

Dcallies commented Oct 28, 2024 •

edited

Loading

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

Comments

aokolish commented Oct 24, 2024 • edited Loading

Bruce-Pouncey-TAT commented Oct 24, 2024 • edited Loading

Dcallies commented Oct 28, 2024 • edited Loading

aokolish commented Oct 24, 2024 •

edited

Loading

Bruce-Pouncey-TAT commented Oct 24, 2024 •

edited

Loading

Dcallies commented Oct 28, 2024 •

edited

Loading