-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix point hash to be well distributed. #76153
Conversation
is it possible to backport both of this to 0.H? |
6521cd0
to
de47e75
Compare
Yep, easily. |
i'll do it then |
hmm, |
Yeah we don't need to backport #76009, this is enough for the performance benefit. |
It's ironic that we apparently have a test for the hasher that is now broken. |
It's populating a vector with all the hash results for the coordinates from -300.-300 to 300.300, uniquifying them, then asserting that there are more than 0.9 * total_values_generated unique values. It might be interesting to do something like count the number of instances that collide with each other instead,but I don't have any really good grounding here. The other complication is a given std::unordered_map instance isn't necessarily going to even look at all the bits of the returned hash, so I'm not sure that evaluating the whole number returned from the hash is even meaningful? |
Thanks for that context. I did some more thinking and I'm even more bothered by the point hasher. The output is functionally dominated by the y component which actually touches all the bits of the hash. The x component touches at most the lower 32 bits (maybe 33 if it overflows), but especially in this map it only touches the lower 7-8 bits, so the hash is functionally mostly just |
de47e75
to
853c6f3
Compare
853c6f3
to
e08346c
Compare
Given that
We could use something like the following to measure the quality of the hash function. I would at least like to approach optimality (badness -> 0) for the case of bubble coordinates (or, to future proof, double the current bubble coordinates and 10 bits for z-levels). https://artificial-mind.net/blog/2021/10/09/unordered-map-badness template <class Map>
double unordered_map_badness(Map const& map)
{
auto const lambda = map.size() / double(map.bucket_count());
auto cost = 0.;
for (auto const& [k, _] : map)
cost += map.bucket_size(map.bucket(k));
cost /= map.size();
return std::max(0., cost / (1 + lambda) - 1);
} |
On VS and for libc++ at least the internal map from hash to index is something like As a side note, do we use libc++? And if we do, are we using powers-of-2 or prime numbers mode? |
That is wild, considering (I thought) that an ideal hash should have negligible correlation between input bits and output bits. Changing one bit in the input 'should' change half the output bits. I just checked msvc and it's using fnv hash for all trivial types including integers, but indeed godbolt asserts gcc/clang (with their default settings which I guess is linking libstdc++) do not. We don't do any standard library shenanigans, but idk which we are linking against for releases. Probably not llvm except on mac builds. |
Oh hmm, I checked running the previous interleave-based value through std::hash and yep it had exactly the same results as without hashing. |
I've been drafting a test that tries to characterize unordered map badness, but with much less refinement, I'll see about switching to that snippet. |
I will grab an OSS fnv hash which is what msvc already uses. This also then solves tripoint as well. |
In practice, the current state of the PR that just concatenates x and y (and runs it through the hash function that may not do anything) is getting a 0.0 on that badness evaluation when I run the coordinates { -300, -300 } to { 300, 300 } through it. I'm not against chasing down further improvements, but this is fine, actually, and I need to double check but I think it's actually collision-free in the positive quadrant.
|
|
I'm building on a kind of ancient gcc 9.4.0 So the current lead is this xor-shift, also it's much faster than FNV since it's a whole integer operation instead of a bunch of individual octet operations.
For reference and review, this is my FNV
EDIT |
After sorting out the issues in the FNV algorithm around unpacking the members into octets, it's scoring a 0.0 on the badness algorithm with the map-scale test. |
Then, switching from MSVC std::hash to using |
That's library specific then because I'm seeing results with multiple entries per bucket. I'll switch to map and see how it goes though.
This is the same symptom that was appearing earlier with concatenation, sign extension of the y member was stomping over the data from the x member. |
msvc with fnv (and unordered_map) is a clear winner over the symmetric range -300,-300 -> 300,300. 'new hash' and 'fnv hash' giving identical results is good for validating the handrolled fnv matches spec.
For the positive quadrant, fnv gives perfect results and xor is worst again.
I hacked up some braindead equivalents of the hashes for tripoint, and it does not suffer the downsides of the old hash as badly. Basic hash is still bad at 1.3 something but the rest pass, even just expanding concathash to 'xor with hash of z'. |
Try replacing the xor_test (given that it is clearly bad) with a simple xorshift. I get 0.069279 with this and 1.11878 with FNV across (-300,-300) -> (300,300) on MSVC 19.40.33811 struct xor_test_hash {
std::size_t operator()( const point &k ) const noexcept {
uint64_t x = static_cast<uint64_t>( k.x ) << 32 | static_cast<uint32_t>( k.y );
return x * 0xd989bcacc137dcd5ull >> 32u;
}
}; Code#include <iostream>
#include <unordered_map>
#include <array>
struct point {
int x;
int y;
friend bool operator==(const point &p, const point & other) {
return p.x == other.x && p.y == other.y;
}
point(int x, int y) : x (x), y(y) {};
};
const int MAX_COORDINATE = 300;
struct new_test_hash {
std::size_t operator()( const point &k ) const noexcept {
return std::hash<uint64_t> {}((static_cast<uint64_t>( k.x ) << 32 | static_cast<uint32_t>( k.y )));
}
};
struct old_test_hash {
std::size_t operator()( const point &k ) const noexcept {
constexpr uint64_t a = 2862933555777941757;
size_t result = k.y;
result *= a;
result += k.x;
return result;
}
};
struct xor_test_hash {
std::size_t operator()( const point &k ) const noexcept {
uint64_t x = static_cast<uint64_t>( k.x ) << 32 | static_cast<uint32_t>( k.y );
return x * 0xd989bcacc137dcd5ull >> 32u;
}
};
#define FNV_64_PRIME static_cast<uint64_t>(0x100000001b3ULL)
#define FNV1_64_INIT static_cast<uint64_t>(0xcbf29ce484222325ULL)
struct fnv_test_hash {
std::size_t operator()( const point &k ) const noexcept {
const uint32_t x = k.x;
const uint32_t y = k.y;
uint64_t hval = FNV1_64_INIT;
hval ^= static_cast<uint64_t>( x & 0xFF );
hval *= FNV_64_PRIME;
hval ^= static_cast<uint64_t>( ( x >> 8 ) & 0xFFul );
hval *= FNV_64_PRIME;
hval ^= static_cast<uint64_t>( ( x >> 16 ) & 0xFFul );
hval *= FNV_64_PRIME;
hval ^= static_cast<uint64_t>( ( x >> 24 ) & 0xFFul );
hval *= FNV_64_PRIME;
hval ^= static_cast<uint64_t>( y & 0xFFul );
hval *= FNV_64_PRIME;
hval ^= static_cast<uint64_t>( ( y >> 8 ) & 0xFFul );
hval *= FNV_64_PRIME;
hval ^= static_cast<uint64_t>( ( y >> 16 ) & 0xFFul );
hval *= FNV_64_PRIME;
hval ^= static_cast<uint64_t>( ( y >> 24 ) & 0xFFul );
hval *= FNV_64_PRIME;
return hval;
}
};
template <class Map>
double unordered_set_badness( Map &test_map )
{
new_test_hash hash;
for( int x = -MAX_COORDINATE; x <= MAX_COORDINATE; ++x ) {
for( int y = -MAX_COORDINATE; y <= MAX_COORDINATE; ++y ) {
test_map.emplace( point(x, y), 1 );
}
}
size_t num_buckets = test_map.bucket_count();
double const lambda = test_map.size() / double( num_buckets );
std::unordered_map<uint64_t, uint64_t> histogram;
double cost = 0.0;
for( auto it = test_map.begin(); it != test_map.end(); ++it ) {
auto i = test_map.bucket(it->first);
// printf("(x, y): (%d, %d) b: %x h: %x \n", it->first.x, it->first.y, i, hash(it->first));
int size = test_map.bucket_size( i );
cost += size * size;
auto result = histogram.find(size);
if (result != histogram.end()) {
result->second += 1;
} else {
histogram.emplace(size, 1);
}
}
printf("Bucket Size Distribution\n");
for (auto const &x : histogram) {
printf("Size %lld: %lld \n", x.first, x.second);
}
cost /= test_map.size();
return std::max( 0.0, cost / ( 1 + lambda ) - 1 );
}
int main() {
std::unordered_map<point, char, new_test_hash> new_map;
std::unordered_map<point, char, fnv_test_hash> fnv_map;
std::unordered_map<point, char, xor_test_hash> xor_map;
printf("New Hash:\n");
double bad = unordered_set_badness( new_map );
printf("%f \n\n", bad);
printf("FNV Hash:\n");
double bad2 = unordered_set_badness( fnv_map );
printf("%f \n\n", bad2);
printf("XOR Hash:\n");
double bad3 = unordered_set_badness( xor_map );
printf("%f \n\n", bad3);
return 0;
} |
I'm kind of surprised you're getting different results than me with fnv hash. 64 bit right? |
Hey CLIDragon I think I see why you're seeing higher numbers, you're cubing the bucket cost instead of squaring it.
in my test it does:
in your test you do:
|
Yep, that was the issue. I now get 0.009281 for FNV (matching akrieger) and 0.039888 for concatenation then FNV (presumably due to poor handling of negatives), and 0.000000 for the xorshift based approach after concatenation. |
I'm starting to get results from the test PR: This matches what you both are reporting from windows.
On the linux build that has completed so far it matches my local results despite being on a different compiler.
|
Now we have mac results.
|
My questions are 1) why does XORSHIFT perform so terribly in windows?
I suspect this is due to using a prime number bucket size instead of a power of 2 bucket size. See llvm/llvm-project@4cb38a8 for more information. |
Based on the symptoms it's only incorporating variations from either the x or y component into the active hash (the part std::unordered_map is using). I'm not at all clear why, this is a very standard and simple hash/prng step. I'm probably not going to have much testing time soon, the next thing I would check is whether it's rows or columns that are ending up in the same buckets, and looking at inouts vs outputs of the xor-shift to see if it's doing anything obviously weird. Based on how it's performing I'm suspicious it's doing something very weird like re-hashing the input hash. Regardinging bucket count growth, I agree with your theory, also I'm not particularly concerned about it. |
A sample of output from the XOR hash is below (
Altering the x value with a constant y only alters the top few bytes. Adding a simple shift at the end of the process means that these bytes are part of the section considered by the map. As this discards the bottom 16 bytes, I suspect this will cause a degenerate case around However, there is a degenerate case with maps with less than ~40000 elements when including negative numbers, and less than ~10000 with only positive numbers. struct xor_test_hash {
std::size_t operator()( const point &k ) const noexcept {
uint64_t x = static_cast<uint64_t>( k.x ) << 32 | static_cast<uint32_t>( k.y );
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
return x >> 16; //Only this line has been changed. 16 seems to be optimal.
}
}; |
How about we do a circular shift: Or a chunky interleave:
Alternately, there are other constants for xor-shift that should better propogate the low order x bits to the low order bits of the hash. If you have a candidate that sorts things out for you locally just paste it here and I can slap it into the test PR. Additionally, the ideal might be a full interleave as outlined here https://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableObvious but that's potentially high impact for runtime. This is what the previous hash was trying to do. |
I did more digging into the literature around hash prospector and found https://nullprogram.com/blog/2018/07/31/ where he states that for 64 bit hashes, there are a couple which he hasn't found better, and one is public domain: https://xoshiro.di.unimi.it/splitmix64.c.
and for 32 bit from the original source
So I would say we should try those, and dispatch to the right one for 64 or 32 bit hashes. imul is definitely Fast Enough on any silicon from the past decade. |
I call it 'slosh hash' in the test because they self describe it as 'sloshing bits around' (xor to the right, mul to the left). It does better than fnv on msvc and its only 3 shifts, 2 muls, and 3 xors for point. For 64-to-32 we can just xor the high bits over the low bits and keep the same base hash. I pushed that hash and some more messaging to #76162. |
Oh see I tried to look up slosh hash and found "SLoSH: Set Locality Sensitive Hashing via Sliced-Wasserstein Embeddings" |
Looks like it's "fine" (< 0.02 badness for all tests) on both linux and windows. |
Looks like on mac the worst is set with badness |
8a1ac59
to
bcac8ac
Compare
I just realized I had copied the MIT licensed hash and not the public one I had referenced in comments. They are supposed to perform similarly well, but I have resubmitted the hash tests with the actually public domain one. |
Re-test confirms everything is just as 'fine' as before. |
bcac8ac
to
bd7d709
Compare
Re: #76154, may i ask you to open PR to backport it? i don't feel myself skillful enough to backport it properly |
Sure I can do it. |
Summary
Performance "Improve point hash functions to eliminate map overhead in nps los checks and elsewhere"
Purpose of change
Bad hash functions can have catastrophic effects on performance. #76009 indirectly revealed that the custom
std::hash
specialization forpoint
was returning poorly distributed values, including only even values. This results in collisions in hash tables like thelru_cache
used for potential LOS lookups which results in 6-10x slowdowns in that function. Absoutely gross. Fix the hash so all tables keying off point don't suffer.Describe the solution
Find a substantiated good 'low bias' public domain 64 bit mixer function. Apply it as-is to the existing point hashers. Don't worry about truncation in 32 bit builds cause who uses those anyway and in theory any bits in the output are good bits.
Describe alternatives you've considered
Testing
Waited 30 ingame minutes in the save from #76009, cost of has_potential_los has almost been eliminated entirely from ~300ms to 50ms, with the get call completely inlined.
Additional context
Bad hash functions are really bad.
I will look at
tripoint
later as it has a similar hash algorithm but using 3 terms might mitigate the negative effects. I will need to find or implement a hash combiner to handle three ints.