-
Notifications
You must be signed in to change notification settings - Fork 58
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add initial rfd, adapted from issue #4000
- Loading branch information
1 parent
b803415
commit d3acd14
Showing
1 changed file
with
115 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,125 @@ | ||
--- | ||
authors: { author1 } <{ email }>, { author2 } ({ github username }) | ||
state: { draft|discussion|published|implemented|abandoned } | ||
discussion: http://github.com/nl-kat-coordination/rfd/pull/{number} | ||
labels: { comma, separated, list } | ||
authors: @originalsouth | ||
state: draft | ||
discussion: https://github.com/minvws/nl-kat-coordination/pull/4004 | ||
labels: hashing, pydantic, ooi | ||
--- | ||
|
||
# RFD {number}: {title} | ||
# RFD 0003: OOI Hashing | ||
|
||
## Introduction | ||
## Background | ||
|
||
{ Introduction to the RFD } | ||
The current `__hash__` implementation of OOI's: | ||
|
||
https://github.com/minvws/nl-kat-coordination/blob/8730e188e9dad6276a363aaeaead8fc1deb82ac9/octopoes/octopoes/models/__init__.py#L242-L243 | ||
|
||
is broken because it only considers the primary key; meaning that OOI's with | ||
fields _not_ recorded in the primary key are erroneously deemed to be the same | ||
objects, causing Python's built-in hash dependent structures to find collapses. | ||
|
||
## Proposal | ||
|
||
{ The proposal section should contain the main content of the RFD. This is | ||
where the idea is explained in detail. } | ||
Since we are dealing with OOI based on Pydantic BaseModel's we can easily | ||
generate a dict of the object using `model_dump`. Assuming that this is the | ||
best object to start to begin our `__hash__` implementation on, the question | ||
becomes how to best hash a dict (as python still hasn't figured out how to do | ||
this natively). | ||
|
||
The natural question arises why not hash `model_dump_json`? Because there is no | ||
guarantee it is stable see | ||
https://github.com/pydantic/pydantic/discussions/10343. | ||
|
||
## Evaluation | ||
|
||
Hence, here I compare two algorithms with benchmarks: | ||
|
||
```python | ||
#!/usr/bin/env python | ||
|
||
from typing import Iterable, Any | ||
import jcs | ||
import random | ||
import string | ||
import time | ||
|
||
N = 8 | ||
MAX_DEPTH = 4 | ||
K = 10**6 | ||
|
||
|
||
def random_string(): | ||
return ''.join(random.choice(string.ascii_letters) for _ in range(random.randint(1, N))) | ||
|
||
|
||
def random_obj(depth: int = 0, k: int = 7) -> Any: | ||
if depth < MAX_DEPTH: | ||
rand_choice = random.randint(0, k) | ||
if rand_choice == 0: | ||
return random.random() # float | ||
elif rand_choice == 1: | ||
return random.randint(0, 10**6) # int | ||
elif rand_choice == 2: | ||
return random_string() # string | ||
elif rand_choice == 3: | ||
return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))] # list | ||
elif rand_choice == 4: | ||
return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))] # list | ||
# Non JSON compatible so broken for hasher_2 but hasher_1 digests it | ||
# return {random_obj(depth + 1, 2) for _ in range(random.randint(1, N))} # set | ||
else: | ||
return {random_string(): random_obj(depth + 1) for _ in range(random.randint(1, N))} # dict[str, Any] | ||
# Non JSON compatible so broken for hasher_2 but hasher_1 digests it | ||
# return {random_obj(depth + 1, 2): random_obj(depth + 1) for _ in range(random.randint(1, N))} # dict[Any, Any] | ||
else: | ||
return random_string() | ||
|
||
|
||
targets = [random_obj() for _ in range(K)] | ||
|
||
|
||
def hasher_1(obj: Any) -> int: | ||
def freeze(obj: Iterable[Any | Iterable[Any]]) -> Iterable[int]: | ||
if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes)): | ||
if isinstance(obj, dict): | ||
for key, value in obj.items(): | ||
yield hash(key) | ||
yield from freeze(value) | ||
else: | ||
for item in obj: | ||
yield from freeze(item) | ||
else: | ||
yield hash(obj) | ||
return hash(tuple(freeze(obj))) | ||
|
||
|
||
def hasher_2(obj: Any) -> int: | ||
return hash(jcs.canonicalize(obj)) | ||
|
||
|
||
st = time.time_ns() | ||
_ = list(map(hasher_1, targets)) | ||
dt = time.time_ns() - st | ||
|
||
print(f"hasher_1: {dt / 10**9 / K}s") | ||
|
||
|
||
st = time.time_ns() | ||
_ = list(map(hasher_2, targets)) | ||
dt = time.time_ns() - st | ||
|
||
print(f"hasher_2: {dt / 10**9 / K}s") | ||
``` | ||
|
||
Resulting in: | ||
|
||
``` | ||
hasher_1: 2.213041571e-05s | ||
hasher_2: 3.159127834e-05s | ||
``` | ||
|
||
## Determinations | ||
|
||
## Footnotes | ||
Personally, I would opt for `hasher_1` as it more flexible and faster, but | ||
`hasher_2` is easier to maintain; also open to other suggestions. | ||
|
||
{ Footnotes should be used to provide references to external sources or to | ||
provide additional information. } | ||
So how do we proceed to solve this problem? |