Current `hash` implementation of OOI is broken #4000

originalsouth · 2024-12-31T15:55:33Z

The current __hash__ implementation of OOI's:

nl-kat-coordination/octopoes/octopoes/models/__init__.py

Lines 242 to 243 in 8730e18

    
           def __hash__(self): 
        
               return hash(self.primary_key)

is broken because it only considers the primary key; meaning that OOI's with fields not recorded in the primary key are erroneously deemed to be the same objects, causing Python's built-in hash dependent structures to find collapses.

Since we are dealing with OOI based on Pydantic BaseModel's we can easily generate a dict of the object using model_dump. Assuming that this is the best object to start to begin our __hash__ implementation on, the question becomes how to best hash a dict (as python still hasn't figured out how to do this natively).

The natural question arises why not hash model_dump_json? Because there is no guarantee it is stable see pydantic/pydantic#10343.

Hence, here I compare two algorithms with benchmarks:

#!/usr/bin/env python

from typing import Iterable, Any
import jcs
import random
import string
import time

N = 8
MAX_DEPTH = 4
K = 10**6


def random_string():
    return ''.join(random.choice(string.ascii_letters) for _ in range(random.randint(1, N)))


def random_obj(depth: int = 0, k: int = 7) -> Any:
    if depth < MAX_DEPTH:
        rand_choice = random.randint(0, k)
        if rand_choice == 0:
            return random.random()  # float
        elif rand_choice == 1:
            return random.randint(0, 10**6)  # int
        elif rand_choice == 2:
            return random_string()  # string
        elif rand_choice == 3:
            return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))]  # list
        elif rand_choice == 4:
            return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))]  # list
            # Non JSON compatible so broken for hasher_2 but hasher_1 digests it
            # return {random_obj(depth + 1, 2) for _ in range(random.randint(1, N))}  # set
        else:
            return {random_string(): random_obj(depth + 1) for _ in range(random.randint(1, N))}  # dict[str, Any]
            # Non JSON compatible so broken for hasher_2 but hasher_1 digests it
            # return {random_obj(depth + 1, 2): random_obj(depth + 1) for _ in range(random.randint(1, N))}  # dict[Any, Any]
    else:
        return random_string()


targets = [random_obj() for _ in range(K)]


def hasher_1(obj: Any) -> int:
    def freeze(obj: Iterable[Any | Iterable[Any]]) -> Iterable[int]:
        if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes)):
            if isinstance(obj, dict):
                for key, value in obj.items():
                    yield hash(key)
                    yield from freeze(value)
            else:
                for item in obj:
                    yield from freeze(item)
        else:
            yield hash(obj)
    return hash(tuple(freeze(obj)))


def hasher_2(obj: Any) -> int:
    return hash(jcs.canonicalize(obj))


st = time.time_ns()
_ = list(map(hasher_1, targets))
dt = time.time_ns() - st

print(f"hasher_1: {dt / 10**9 / K}s")


st = time.time_ns()
_ = list(map(hasher_2, targets))
dt = time.time_ns() - st

print(f"hasher_2: {dt / 10**9 / K}s")

Resulting in:

hasher_1: 2.213041571e-05s
hasher_2: 3.159127834e-05s

Personally, I would opt for hasher_1 as it more flexible and faster, but hasher_2 is easier to maintain; also open to other suggestions.

So how do we proceed to solve this problem?

The text was updated successfully, but these errors were encountered:

originalsouth · 2024-12-31T15:56:50Z

This ticket is related to #3808 and has to be solved either in that branched or before that branch is merged.

originalsouth · 2025-01-01T13:30:24Z

Also we'll have to consider whether we want for:

d1 = {'a': 1, 'b': 2}
d2 = {'b': 2, 'a': 1}

to have different hashes (ie. hash(d1) == hash(d2) or hash(d1) != hash(d2)).
(this because python dicts are ordered)

noamblitz · 2025-01-01T14:09:52Z

Also we'll have to consider whether we want for:
d1 = {'a': 1, 'b': 2}
d2 = {'b': 2, 'a': 1}
to have different hashes (ie. hash(d1) == hash(d2) or hash(d1) != hash(d2)). (this because python dicts are ordered)

My gut tells me we don't want them to be equal. Lets say we have a nibble that creates a finding when two URLs are similar, we would want the finding to exist on both urls.

originalsouth · 2025-01-01T14:22:56Z

Right (although, don't understand your reasoning...) ~~but this plays counter-intuitively with pydantic it seems...~~ pydantic seems to work accordingly

jpbruinsslot · 2025-01-01T15:22:31Z

Started a RFD for this issue here: #4004

jpbruinsslot · 2025-01-02T08:38:20Z

Would a subset of ooi fields be considered unique/similar enough to hash? I could imagine that an object still needs to be considered identical when for instance a field contains a timestamp (don't know if this is a likely scenario though).

originalsouth · 2025-01-02T12:22:01Z

@underdarknl wrote in #4004

I'd like a little more explanation on why we chose to override the default hashing and implement hashing only on the primary-key fields: Because we cannot have two different objects in the graph with the same primary key, it makes sense to also have the rest of the codebase 'think' they are the same by hashing only those fields that would lead to different objects in the graph.

This makes sense from an XTDB perspective. From an Octopoes perspective, however, objects with the same primary key could differ there as attributes are mutated by Octopoes.

I'd also like a little more info on why and where this is problematic for the python code itself if we disregard the notion that we can only keep one version of two objects in the database If their primary key is the same.

It is only a performance impact as long as we assure that the objects are unequal. The builtin python structures are able to deal with collisions at a performance cost. It is, however, ill practice to have very similar objects have the same hash, as we cannot guarantee that every implementation we use will handle such collisions properly.

Furthermore a bit more details on why this specifically is a problem in the context of dealing with Objects in nibbles where we want to 'see' changes on any field that is used/consumed/read/queried by the nibble, regardless of what the primary key's values are. For example, having TTL for a DNSRecord in the primary key is useless for our Graph, but you could still envision a nibble that alerts users of DNSRecords with TTL's that are either too long, or too short. Not being able to see TTL changes (because from the outside the hash of the ooi stayed the same) would mean we only get to run the Nibble once, which is obviously not what you'd want.

In particular for nibbles the problem is twofold:
1.Nibbles xxhash OOI's for recalculation, the underlying serialization (necessary for the resolution of this problem) must be reliable for hashing.
2. Nibbles are written with uniqueness of objects in mind and uses dicts and sets as such; they will likely suffer performance impact. Both because the data structures get slower and because nibbles will run/rerun unexpectedly.

n.b. A test nibble doing exactly this (min / max on TTL based on a config) would be a great way of demonstrating this, and adding functionality.

I will see if I can make such a test, in principle the hashing as done now is stable see:

nl-kat-coordination/octopoes/tests/integration/test_nibbles.py

Lines 317 to 346 in 1b6124f

    
           def test_nibbles_update(xtdb_octopoes_service: OctopoesService, event_manager: Mock, valid_time: datetime): 
        
               xtdb_octopoes_service.nibbler.nibbles = {find_network_url_nibble.id: find_network_url_nibble} 
        
               network1 = Network(name="internetverbinding") 
        
               xtdb_octopoes_service.ooi_repository.save(network1, valid_time) 
        
               event_manager.complete_process_events(xtdb_octopoes_service) 
        
               network2 = Network(name="internet") 
        
               xtdb_octopoes_service.ooi_repository.save(network2, valid_time) 
        
               event_manager.complete_process_events(xtdb_octopoes_service) 
        
               url1 = URL(network=network1.reference, raw="https://potato.ls/") 
        
               xtdb_octopoes_service.ooi_repository.save(url1, valid_time) 
        
               event_manager.complete_process_events(xtdb_octopoes_service) 
        
               url2 = URL(network=network2.reference, raw="https://mispo.es/") 
        
               xtdb_octopoes_service.ooi_repository.save(url2, valid_time) 
        
               event_manager.complete_process_events(xtdb_octopoes_service) 
        
               assert xtdb_octopoes_service.ooi_repository.list_oois({Finding}, valid_time).count == 1 
        
               assert xtdb_octopoes_service.ooi_repository.list_oois({KATFindingType}, valid_time).count == 0 
        
               find_network_url_nibble_v2 = find_network_url_nibble.model_copy(deep=True) 
        
               find_network_url_nibble_v2._payload = getattr(sys.modules[__name__], "find_network_url_v2") 
        
               find_network_url_nibble_v2._checksum = "deadbeef" 
        
               xtdb_octopoes_service.nibbler.update_nibbles( 
        
                   valid_time, {find_network_url_nibble_v2.id: find_network_url_nibble_v2} 
        
               ) 
        
               event_manager.complete_process_events(xtdb_octopoes_service) 
        
               assert xtdb_octopoes_service.ooi_repository.list_oois({Finding}, valid_time).count == 1 
        
               assert xtdb_octopoes_service.ooi_repository.list_oois({KATFindingType}, valid_time).count == 1

but it has been justly questioned whether this is true for all cases on all platforms and such see:
https://github.com/minvws/nl-kat-coordination/pull/3808/files/c4c561d469d88cb9f75cba403ebded744925e81f#r1893873228

Hope that satisfies at least parts of you requests.

underdarknl · 2025-01-03T09:05:51Z

Would a subset of ooi fields be considered unique/similar enough to hash? I could imagine that an object still needs to be considered identical when for instance a field contains a timestamp (don't know if this is a likely scenario though).

It depends on the usecase. For Uniqueness in the graph, the primary-key fields are enough. For detecting field-value changes when dealing with nibbles, the minimal set of fields that we could hash could be based on the full set of fields possibly accessed and or processes in that specific nibble.
This leads us to two (or more) separate hashing needs, unless we find a way to cover them all with a single hash, however we'd need to somehow do this without triggering unneeded false-positives in the situations where we where not interested in all those extra fields being in our hash.

originalsouth · 2025-01-03T13:20:31Z

It depends on the usecase. For Uniqueness in the graph, the primary-key fields are enough. For detecting field-value changes when dealing with nibbles, the minimal set of fields that we could hash could be based on the full set of fields possibly accessed and or processes in that specific nibble.

As mentioned before, I think it is important to separate XTDB and Octopoes cases.

For the XTDB case primary keys are sufficient.
For the Octopoes case (i.e. Python side of the story), which mutates OOI's, it is not so a priori.

Or to put it other words one puts two OOI's with the same primary-key/hash in a set/dict, what would be the desired behavior: everything else has to be implemented accordingly.

One can use this code to fool around with the possibilities, by changing the __hash__/__eq__ functions accordingly:

#!/usr/bin/env python

import random
from typing import Any

from pydantic import BaseModel


class Potato(BaseModel):
    potato_properties: dict[str, Any]

    def __hash__(windows95) -> int:
        return random.randint(0, 0xFFFFFFFF)

    def __eq__(windows95, _) -> bool:
        return True


potato1 = Potato(potato_properties={"type": "piper"})
potato2 = Potato(potato_properties={"type": "king-edward"})
potato3 = Potato(potato_properties={"type": "kennebec"})
potato4 = Potato(potato_properties={"type": "piper"})

s = set([potato1, potato2, potato3, potato4])
d = {potato1: "fried", potato2: "baked", potato3: "boiled", potato4: "fried"}

print(s)
print(d)

Note that there are possibly other routines that use __hash__ under hood -- which we might use and should take note off like chain-maps, caches, counters, unions and the like.

dekkers · 2025-01-03T17:26:25Z

Did everyone read the definition of hash, __hash__ and hashable? If not please do so, because this is crucial.

The only purpose of the hash is to provide an integer that can be used by hash tables. This is necessary for hash table based data structures such as sets and dictionaries. Defining __hash__ on a class that also defines __eq__ makes it possible to use instances of such a class in sets and as key in a dictionary.

The first requirement of __hash__ is that the return value must be equal if the object compares equal. That the hash of two objects are equal does not mean the objects compare equal, it is possible for two different objects to have the same hash value. This isn't a problem because hash tables need to deal with collisions anyway. If it happens too often it might result in performance problems or even DoS attack if an attacker can causes a lot of collisions.

The other requirement is that the hash value of an object must never change. This is logical given that it is used to determine the hash table bucket of the object, if the hash would change it means an object would be in the wrong bucket of the hash table. This is also the reason a mutable dictionary doesn't have a hash and can't be used in sets or as dictionary keys, because there is no way to statify both of these requirements.

So is our current __hash__ implementation broken given the requirements? As far as I can see this doesn't seem to be the case. We don't really support mutating OOI python objects however, at least not changing attributes that are part of the primary key, because the primary_key attribute is only computed once in model_post_init. This does not seem to break any of the Python requirements of __hash__ however. It does gives some surprising results if you do mutate attributes that are used in the primary key:

In [2]: n1 = Network(name="test")

In [3]: n2 = Network(name="test")

In [4]: n3 = Network(name="different")

In [5]: n1 == n2
Out[5]: True

In [6]: n1 == n3
Out[6]: False

In [7]: n2.name = "different"

In [8]: n1 == n2
Out[8]: False

In [9]: n2 == n3
Out[9]: False

In [10]: s = set() ; s.add(n1); s.add(n2) ; s.add(n3)

In [11]: s
Out[11]:
{Network(object_type='Network', scan_profile=None, user_id=None, primary_key='Network|different', name='different'),
 Network(object_type='Network', scan_profile=None, user_id=None, primary_key='Network|test', name='different'),
 Network(object_type='Network', scan_profile=None, user_id=None, primary_key='Network|test', name='test')}

I think in most places we create new python objects so that this doesn't result in problems. It would have been better if those attributes would have been immutable to prevent mistakes, but I don't think it is worth spending time on changing that given that we want to move to XTDB 2.

Note that Django ORM does something similar as we do: it defines __hash__ as the hash of the primary key. The difference is that Django also defines an __eq__ method that considers an object of the same type and with the same primary key as equal.

originalsouth · 2025-01-06T07:52:28Z

@dekkers this isn't an XTDB1/XTDB2 issue. The point is that Octopoes can generate OOI's with the same PK but with other attributes. While we have a way of dealing with that in XTDB (by "adding the fields") in the Octopoes/Python domain these are distinct objects that now have the same hash -- this is wrong, and that is what this issue is about -- and how to deal with it.

originalsouth added bug Something isn't working discussion octopoes Issues related to octopoes labels Dec 31, 2024

originalsouth added this to KAT Dec 31, 2024

github-project-automation bot moved this to Incoming features / Need assessment in KAT Dec 31, 2024

originalsouth moved this from Incoming features / Need assessment to To be discussed in KAT Dec 31, 2024

originalsouth assigned originalsouth, underdarknl, Donnype, dekkers, jpbruinsslot, ammar92 and noamblitz and unassigned underdarknl and originalsouth Dec 31, 2024

originalsouth mentioned this issue Dec 31, 2024

Towards better inference: bits → nibbles #3808

Open

9 tasks

jpbruinsslot added a commit that referenced this issue Jan 1, 2025

Add initial rfd, adapted from issue #4000

d3acd14

originalsouth mentioned this issue Jan 2, 2025

RFD 0003: OOI Hashing #4004

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current `hash` implementation of OOI is broken #4000

Current `hash` implementation of OOI is broken #4000

originalsouth commented Dec 31, 2024

originalsouth commented Dec 31, 2024

originalsouth commented Jan 1, 2025 •

edited

Loading

noamblitz commented Jan 1, 2025

originalsouth commented Jan 1, 2025 •

edited

Loading

jpbruinsslot commented Jan 1, 2025

jpbruinsslot commented Jan 2, 2025

originalsouth commented Jan 2, 2025

underdarknl commented Jan 3, 2025

originalsouth commented Jan 3, 2025

dekkers commented Jan 3, 2025

originalsouth commented Jan 6, 2025

Current __hash__ implementation of OOI is broken #4000

Current __hash__ implementation of OOI is broken #4000

Comments

originalsouth commented Dec 31, 2024

originalsouth commented Dec 31, 2024

originalsouth commented Jan 1, 2025 • edited Loading

noamblitz commented Jan 1, 2025

originalsouth commented Jan 1, 2025 • edited Loading

jpbruinsslot commented Jan 1, 2025

jpbruinsslot commented Jan 2, 2025

originalsouth commented Jan 2, 2025

underdarknl commented Jan 3, 2025

originalsouth commented Jan 3, 2025

dekkers commented Jan 3, 2025

originalsouth commented Jan 6, 2025

Current `hash` implementation of OOI is broken #4000

Current `hash` implementation of OOI is broken #4000

originalsouth commented Jan 1, 2025 •

edited

Loading

originalsouth commented Jan 1, 2025 •

edited

Loading