Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

Open
aokolish opened this issue Oct 24, 2024 · 2 comments
Open

[py-tx][tat] TAT API implementation doesn't work correctly with HMA #1668

aokolish opened this issue Oct 24, 2024 · 2 comments

Comments

@aokolish
Copy link

aokolish commented Oct 24, 2024

I'm testing out hasher-matcher-actioner + tech against terrorism (TAT) API and noticed a log line from the server:

INFO in fetcher: TAT_HASHES[tat] Is a NoCheckpointing class, which hopefully is a test type, and we have a checkpoint. Considering complete

This leads me to believe that HMA would never pull new hashes from the TAT API. Can this somehow be fixed in the TAT exchange implementation?

Here are their API docs - https://www.terrorismanalytics.org/docs/hash-list-v1

There are probably easier steps to reproduce, but my steps were...

  • deploy HMA
  • configure the TAT exchange with prod credentials
  • run HMA fetcher
  • check logs for this message
@aokolish aokolish changed the title [py-tx] TAT [py-tx] Cannot fetch TAT hashes more than once Oct 24, 2024
@Bruce-Pouncey-TAT
Copy link
Contributor

Bruce-Pouncey-TAT commented Oct 24, 2024

Hi @aokolish
Bruce here from TAT, perhaps I can be of assistance.

Our hash list API delivers a JSON file which is updated on a nightly basis as you have seen in our documentation.
In this scenario a checkpoint would be difficult to keep track of as the list is delivered in full every time via a single request.

In the current implementation running threatexchange fetch would download the entire hash list again along with any new hashes that would have not been included in the previous fetch. We want the system to assume the list stale on every fetch.

I'm happy to go over this further to help you find a solution, or if there is something I'm misunderstanding.

@Dcallies
Copy link
Contributor

Dcallies commented Oct 28, 2024

Hey @aokolish , @Bruce-Pouncey-TAT - I looked into this briefly, and this might be a missing functionality in HMA, which currently can't handle APIs like TATs that doesn't handle deltas but also changes over time. Since there is no efficient way to discover updated records with this kind of API, we'd need to write something new in HMA to load all the previously downloaded records from TAT in memory, then create a diff itself (especially removals). We could also force clear all the hashes every time, but as the number of hashes grows over time, this creates weird inconsistencies in the database (hashes disappearing and reappearing in the index) that might have real production impact.

This feature doesn't exist today, and so by default TAT isn't correctly supported by HMA, hence the logs that Alex is seeing.

To evaluate the potential solutions:

  1. @Bruce-Pouncey-TAT - my top recommendation is to switch to a delta-based API like NCMEC, StopNCII, and ThreatExchange on the Tech Against Terrorism side. In the long term, your users will thank you, as the cost of keeping a correct copy grows with the size of your database, and it's not too hard as long as you are storing hashes in a backing database. I have implemented multiple versions of this type of API, and helped other programs make this same jump. This is by far the easiest solution.
  2. If whatever reason TAT can't update their API, I can more fully describe an implementation in the opening paragraph and add it to an issue, for someone to attempt.

@Dcallies Dcallies changed the title [py-tx] Cannot fetch TAT hashes more than once [py-tx][tat] TAT API implementation doesn't work correctly with HMA Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants