Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API and speed are... problematic. #8

Closed
DoubleHyphen opened this issue Dec 21, 2020 · 8 comments
Closed

API and speed are... problematic. #8

DoubleHyphen opened this issue Dec 21, 2020 · 8 comments

Comments

@DoubleHyphen
Copy link

tl;dr Please, help me make both our crates better.

Hello! I'm John, creator and maintainer of rival crate "morton-encoding".

With the arrival of min-const-generics pretty soon, I wanted to refactor, and so I was doing some snooping around to see if there was anyone else who had implemented the same things. Thus, I chanced upon your crate, and decided to benchmark it. So I ran into...

Problem no. 1: The API is really, really hard to use with primitive types.
OK, time to get results from the crate. First try, [x1, x2].z_index()... No dice.
Maybe [x1, x2].iter().z_index()? Nope, that's not it either.
At this point, I gave up and did it with a tuple just as it shows in the documentation. But seriously, a tuple?! You want the user to input a certain quantity of homogeneous data, and your first idea is to use a generally heterogeneous type?
And don't even get me started on trying to transmute the result to a primitive type. Was result as Key too easy or something? Or was result.unwrap().iter_storage().fold(0 as Key, |acc, a| (acc<<usize_bits) | a as Key) the most ergonomic way you could manage? (Oh, and did I need to put a .rev() there or not? Who knows!)

Okay, fine. We've managed to make a function that takes an input ([u32; 2]) and gives a result (u64). That's something. Until, of course, I ran into...

Problem no. 2: The speed is low. Comically low. Abysmally, inexplicably low.
I ran cargo rustc --release --lib -- --emit asm and took a look at the emitted assembly.
Then, I called a crane to pick my jaw off the floor.
Then I wired it shut and looked again.
The one, single, paltry [u32; 2] -> u64 function ends up emitting three thousand lines of assembly. Half of those are the meat of the function, and several of those are function calls to the other half. I didn't have the courage to benchmark it.
For reference, the naïve implementation, that uses masks to check each bit and copy it to the result? Six hundred instructions. morton-encoding does it in less than 50.

I'd like it if similar crates could somehow be consolidated. For that reason, I have to ask: Sir, what problem are you trying to solve? What needs do you have that morton-encoding cannot cover? Because, from what I can see, it's far superior, and it's, like... right there. You can use it, either directly as a dependency or indirectly as inspiration.

Help me make both our crates better, please?

@blakehawkins
Copy link
Owner

Hello! Thanks for opening an issue. I'll preface my response by pointing out that this crate is sort of dead/hasn't been updated in a year, and there are some critical issues needed for basic intended functionality, namely #1, #2, #3

Anyway, on to your points:

Thus, I chanced upon your crate, and decided to benchmark it.

Can you tell me why you want to benchmark crates? Are you using morton encoding for performance-critical applications?

I think there's a huge context gap here because I've never actually used morton encoding in production. I'll comment on that further in a later part of your question.

The API is really, really hard to use with primitive types.

Yes, I agree. This is tracked in #6 and #7. You can take a look at the structs in the docs for a reference on which primitives have partner BitCollections provided. There are also a couple examples of custom BitCollections, since the provided ones are sort of half-way between reference and utility functions.

At this point, I gave up and did it with a tuple just as it shows in the documentation.

It seems a little bit weird to me to try writing some random code without first reading the 3-line example on the readme... In any case, though, I agree with your point.

But seriously, a tuple?! You want the user to input a certain quantity of homogeneous data, and your first idea is to use a generally heterogeneous type?

Well, no 😅 I designed this for use with heterogeneous types. Maybe you can explain how morton encoding is normally used? I think I am missing something, but heterogeneous is definitely my goal from the start.

Was result as Key too easy or something? Or was result.unwrap().iter_storage().fold(0 as Key, |acc, a| (acc<<usize_bits) | a as Key) the most ergonomic way you could manage? (Oh, and did I need to put a .rev() there or not? Who knows!)

Good point. I hope that you at least saw this example before trying to figure out the Vob APIs. I've opened #9 to track this.

The speed is low. Comically low. Abysmally, inexplicably low.

(Commenting on the whole problem 2 now) something is a bit off about the conversation here, but it's not clear to me what the most crucial detail is. I think it comes down to not really know ing how morton encoding is used in production. Anyway:

  1. Is it actually slow - have you found an example input that's prohibitively slow to compute? If not, do you have a reference to some exponential complexity that I don't know about?
  2. I don't think the emitted code size is that interesting/comparable. Rust uses monomorphisation for generics and zdex uses some known generics cases (tuples) that explode code size. I would test the speed myself but, this comes back to your last question:

I'd like it if similar crates could somehow be consolidated. For that reason, I have to ask: Sir, what problem are you trying to solve? What needs do you have that morton-encoding cannot cover? Because, from what I can see, it's far superior, and it's, like... right there. You can use it, either directly as a dependency or indirectly as inspiration.

I would definitely be interested in using morton-encoding as a dependency to provide the higher level APIs, but AFAICT your basic unit of API comprises flat functions with type parameters in the name, and the only way I can imagine using that in a generic way is with some macro kung-fu. (Is that true? It looks like the morton_encode function is generic, actually - in which case perhaps the specialised functions could be moved into a module for legibility)

And finally, the answer to your question, I'll repeat that I've never used morton encoding in production. I made this crate after reading this dynamodb blog post, which has a couple implications...:

  1. Expectation is to send morton-encoded values to AWS front door, so the performance penalty of local computation is some kind of irrelevant
  2. Heterogeneous input types are very much expected/needed feature
  3. Encoded values larger than 128 bits needed
  4. I didn't even know that morton encoding was the same as z-order indexing until I finished writing most of the code + discovered and linked to your crate

BTW, both BitCollection and Vob (used heavily in zdex) are (a) probably not performance-optimised or production-ready and (b) seem to be optimised for size, which isn't really needed here

@DoubleHyphen
Copy link
Author

DoubleHyphen commented Dec 22, 2020

Sorry about the delay! Time to answer.

First things first: Now that I re-read the original post without the sleep deprivation, I realise it was probably more confrontational than it should have been. Seriously, thank you for responding so positively.

Now, on to the contents themselves: I won't go through every point in order, but I hope to cover them all by the end.

  1. You keep mentioning "production". I'm not 100% sure what that means; I've never used this code for a company's benefit, in any event. Below, I will mention the use-case for which I wrote this code; I hope it's compelling enough.

  2. Morton encoding for heterogeneous data? Colour me perplexed. Thus far, I've been operating under the assumption that its biggest benefit is its ability to group together bits of equal importance. This would imply that each co-ordinate will need to have an equal amount of bits, so that each bit has exactly N-1 other bits of equal importance elsewhere. And, if we're assuming that they're all going to have the same amount of bits, we might as well assume that they're the same data-type, if nothing else... which is why I only implemented my own code for homogeneous data types. (If you're curious, this is why the first thing I tried was to use zdex on arrays, despite having seen the tuple example previously.) If you can think of a use case that would need legitimately heterogeneous data, however, I'd be very curious to know it! (Nb my own use-case normalises the data to an array of type [u{whatever}; N] before encoding it, which might be what you're actually after.)

  3. You are quite correct in noticing the macro kung-fu needed to treat arrays in a generic manner. This is, in fact, a short-coming of current stable Rust, and not of morton-encoding. To treat arrays of arbitrary dimensionality generically, one would need to parametrise [sp?] functions not just by type but also by a constant value, something like fn morton_encode_array<Coordinate, Key, const N: usize> (input: [Coordinate; N]) -> Key. As it just so happens, this will in all likelihood land in stable Rust mid-February! I've already re-written the crate to expose this new, way more ergonomic API, and I'm waiting until final stabilisation to push the code to crates.io. (Though, come to think of it, I can upload the code to github in the meantime! I'll do that a bit later, so you can take a look at it.)

  4. I'd be wary of advising people to use the morton_encoding::morton_encode function; the new documentation will make that clearer. The code's speed depends crucially on the compiler having enough information available to fold all the constants, and especially the calls to get_mask; I'm talking 10x difference in speed. The new code will call pre-made functions depending on N, alleviating this for N up to 8; still, using the array functions will probably be faster.

  5. The use-case I was thinking of is actually fractal analysis of N-dimensional sets. Traditionally, the most useful useable method of measuring fractal dimension is as follows: You split your data-set along each axis, creating 2^N partitions, and count how many of them contain at least one element inside. You then split each partition in half recursively, and count, for each scale, how many sub-partitions are populated. Then you make a log-log graph of that, and its slope is the fractal dimension.
    Now, if each data point's co-ordinates are normalised in an unsigned integer type, and you take the first n bits of each co-ordinate, that immediately tells you which n- order partition the bit belongs to. Therefore, if you take the Morton encoding of the co-ordinates of each data point and sort them all, the rest can be computed extremely rapidly.
    Now, the computation of Morton codes is O(N) and the sorting is O(NlogN), so for a large enough N the sorting dominates; but even for 16 megapixel images if you increase the Morton code computations by an order of magnitude (let alone two!) they'll start to dominate again.
    Key point a: There is no guarantee that each co-ordinate will come pre-normalised. For instance, in an image that's 4096 by 4096 pixels, whereas the RGB co-ordinates are pre-normalised to u8s, the x and y co-ordinates will need to be normalised to u8s too by throwing their last 4 bits out before they can be part of an [u8; 5] that will be encoded to an u64.
    Key point b: I'm dreaming of allowing researchers to analyse myriads, maybe millions, of data-sets and then running the results through machine-learning algorithms. If a researcher can obtain valuable data by letting his/her machine run for an hour, that's extremely valuable. If same researcher needs to let the machine run for weeks or even months, s/he is not going to bother.

  6. I skimmed through the article you linked. Suddenly, a lot of the more baffling API choices you made seem to make much more sense. I'll read it more closely and come back to you.

So, to-do list for later:

  1. Read the article you linked me more carefully
    2. Upload the new code to GitHub so you can take a look at it (it currently only works on nightly, at least until mid-February)Done.
    3. Run some actual benchmarks so I can have hard data to back up my description of zdex's speed Done.

@DoubleHyphen
Copy link
Author

DoubleHyphen commented Dec 22, 2020

So, I got around to benchmarking it. I whipped up a quick implementation of my fractal dimension counting algorithm and tested it on some colour image substitutes, using zdex, morton_encoding and a naïve implementation in order to power the [u8; 5] -> u64 transformations.

Results: For 16 megapixels (ie 1<<24 transformations), where the naïve implementation takes ~240 milliseconds, zdex consistently takes two orders of magnitude more.

And that's not even the worst part. The worst part would be...

zdex isn't even producing correct results. If you print the result after turning it into an u64, it turns out to be putting the lowest-order bits first and the highest-order bits last. I am confused as to why that happens, as your examples show that printing the Vob directly gives the bits in the correct order.

But can the Vob be compared as efficiently as a simple u64? Only one way to find out!
one futile try later...
I had expected Vob to be much slower than the u64.
I had not expected Vob to not implement Ord at all. Or Xor, for that matter.

Sir, am I correct in understanding that you're trying to make this crate so the result can be used as a key in a database? If so, wouldn't it need to be, y'know, comparable?

Please find the benchmarking code attached.
const TIME_ENTIRE_ALGORITHM: bool = false;

macro_rules! time {
    ($x: expr) => {{
        // eprintln!("Measuring expression: {}", stringify!($x));
        let begin = std::time::Instant::now();
        let result = $x;
        let tim = begin.elapsed();
        println!("Time elapsed: {:#?}\n", begin.elapsed());
        (result, tim)
}};}

fn morton_encode_u8_5d_zdex (input: [u8; 5]) -> u64 {
    use zdex::*;
    let usize_bits = 8*core::mem::size_of::<usize>();
    let transmute_input = |x: &u8| -> FromU8 {(*x).into()};
    input                       // Take the original input,
        .iter()                 // element by element...
        .map(transmute_input)   // Transform each to the custom input types zdex needs...
        .z_index()              // Compute the result...
        .unwrap()               // Panic if there's an error... (Can there be one? Who knows!)
        .iter_storage()         // Take the result usize by usize...
        .fold(0 as u64, |acc, a| (acc<<usize_bits) | a as u64)
        // ...and finally, unify the iterator of usizes into a single u64.
    // Can you just FEEL the ergonomics?
}

fn morton_encode_u8_5d_naive (input: [u8; 5]) -> u64 {
    let mut coordinate_mask = 1u8;
    let mut key_mask = 1u64;
    let mut result = 0u64;
    for _ in 0..8 {
        for &coordinate in input.iter().rev() {
            if coordinate & coordinate_mask != 0 {
                result |= key_mask;
            }
            key_mask <<= 1;
        }
        coordinate_mask <<= 1;
    }
    result
}

fn main() {

    println!("First, two small tests to ensure that the functions work correctly (Which, siiiiiiiigh...)");
    let argh_2 = [1, 2, 3, 4, 5u8];
    let argh = [128u8; 5usize]; 
    println!("Input:[\n{}]", argh.iter().map(|x| format!("{:08b}\n", x)).collect::<String>());
    println!("{:040b} {:040b}\nAs if all the rest wasn't enough, zdex puts the most significant bits at the end...", morton_encode_u8_5d_naive(argh), morton_encode_u8_5d_zdex(argh));
    //println!("{:040b}", morton_encode_u8_5d_zdex(argh).max(morton_encode_u8_5d_zdex(argh_2)));
    
    // Create the 4 images we usually use as touch-stones...
    let limit_1d: usize = 1<<12;
    let limit_2d: usize = limit_1d * limit_1d;
    let mut test_images: [Vec<[u8; 5]>; 4] = [Vec::with_capacity(limit_2d), Vec::with_capacity(limit_2d), Vec::with_capacity(limit_2d), Vec::with_capacity(limit_2d)]; 
    
    for x in 0..limit_1d {
        for y in 0..limit_1d {
            let x = ((x*256)/limit_1d) as u8;
            let y = ((y*256)/limit_1d) as u8;
            let pixel_fd_2 = [x, y, x, y, 128];
            let pixel_fd_3 = [x, y, x, y, rand::random::<u8>()];
            let pixel_fd_4 = [x, y, x, rand::random::<u8>(), rand::random::<u8>()];
            let pixel_fd_5 = [x, y, rand::random::<u8>(), rand::random::<u8>(), rand::random::<u8>()];
            test_images[0].push(pixel_fd_2);
            test_images[1].push(pixel_fd_3);
            test_images[2].push(pixel_fd_4);
            test_images[3].push(pixel_fd_5);
            // The fractal dimensions of the test images will be respectively equal to 2, 3, 4 and 5.
        }
    }
    
    // Time to get working!
    for i in 0..2 {
        let encod_fn = if i==0 {morton_encode_u8_5d_naive} else {morton_encode_u8_5d_zdex};
        println!("\n\n\nNow testing: {}!", if i==0 {"morton_encoding"} else {"zdex"});
        for test_image in &test_images {
            let _ = time!{{
                let mut partition_count = [0usize; 8];
                partition_count[0] += 1; // 
                let mut img_copy = test_image
                    .iter()
                    .map(|&x| encod_fn(x))
                    .collect::<Vec<_>>();
                if TIME_ENTIRE_ALGORITHM {
                    img_copy.sort();
                    img_copy
                        .windows(2)
                        .map(|x| x[0] ^ x[1])
                        .map(|x| x.leading_zeros() as u8)
                        .map(|x| x - 24) // The first 24 bits are always going to be zero. 
                        .map(|x| x/5) // We take it 5 bits at a time, because we always divide every axis in two simultaneously.
                        .filter(|&x| x<8) // If they're the same bit for bit, there's no reason to count them twice...
                        .for_each(|x|
                            partition_count[x as usize] += 1
                        );
                    for i in 0..7 {
                        partition_count[i+1] += partition_count[i];
                        // We want cumulative counts.
                    }
                    println!("{:?}", partition_count);
                }
            }};
        }
    }
}

@DoubleHyphen
Copy link
Author

I read the article. With that, the afore-mentioned to-do list is complete.

So, with all that said: I now realise why exactly you wanted z-order keys of heterogeneous data. However, I don't really think there's a general solution to the issue.

For instance, let's take a look at the simple example you gave, about the latitude and the longitude.
Issue 1: By performing the f32 as i32 change, you aren't transforming the f32 to its binary representation; you're throwing all the decimal digits away and only keeping the integer part. Thus, the longitude is left with 360 possible values and the latitude with 180, for a total of less than 1<<16 values. You're taking an i32 for each and encoding them in a u64, therefore essentially wasting 75% of the bits.
(Nb Morton encoding the binary representations of the floating-point values would have been many orders of magnitude worse, because then fully 75% of the available values would have been the places whose latitudes and longitudes are less than 1 degree from Point Zero, and of the rest over 90% would fall outside the globe.)
Issue 2: For a longitude/latitude z-order index, 64 bits are complete overkill. After some quick calculations, it turns out that Earth has less than 1<<54 square metres of area. Of course, we don't want metre precision; we aren't tracking an individual's every movement around town! Thus, if we take 32 bits for the values, each one corresponds to an area of about 2km by 2km. I think that's more than enough for a key!
Issue 3: Subdividing the latitude and longitude into equal angles does not lead to equal areas of subdivision! Near the poles you'd have a very fine subdivision and near the equator very coarse. To keep the areas of subdivision equal, you'd need to actually take the sine (IIRC) of the latitude.

Thus, for an ideal z-order key of latitude and longitude, you would need to...

  1. Transform the latitude into its sine
  2. Split the available values of each co-ordinate in 65536, and assign them an unsigned integer ID
  3. Encode the two u16s that you end up with into an u32.

As you can see, even this was actually very complicated, and most importantly it does not generalise to other (f32, f32) pairs. I admire the work you set out to do, but I fear that it does not have a general solution.

@blakehawkins
Copy link
Owner

Hey,

Sorry for the delay in response here. Happy new year! As mentioned above, I am not really investing in this crate at the moment. In any case:

Thus far, I've been operating under the assumption that its biggest benefit is its ability to group together bits of equal importance.

I understand you've read the AWS blog post now, but for any external readers: morton encoding indeed groups bits of equal importance, but if you right-pad unequal sized inputs, you can still get a useful encoding on the output. Right-padding is not the only critical technique though for useful z-order indexing, as described in more detail in the aforementioned blog.

To treat arrays of arbitrary dimensionality generically, one would need to parametrise [sp?] functions not just by type but also by a constant value, something like fn morton_encode_array<Coordinate, Key, const N: usize> (input: [Coordinate; N]) -> Key

I don't really understand why this is needed, but I am probably not quite invested enough. The approach taken in zdex is using trait bounds to allow input types to determine their bit representation - i.e. zdex computes morton encoding of bit collections and types that have bitcollection representations. Providing a default bitcollection representation for primitive types is just convenience.

The use-case I was thinking of is actually fractal analysis of N-dimensional sets.

Cool! Thanks for explaining.

Key point b: I'm dreaming of allowing researchers to analyse myriads, maybe millions, of data-sets and then running the results through machine-learning algorithms. If a researcher can obtain valuable data by letting his/her machine run for an hour, that's extremely valuable. If same researcher needs to let the machine run for weeks or even months, s/he is not going to bother.

On a completely unrelated side-note, this doesn't seem like a big problem in today's world, as your desired work is trivially parallelisable

zdex isn't even producing correct results. If you print the result after turning it into an u64, it turns out to be putting the lowest-order bits first and the highest-order bits last. I am confused as to why that happens, as your examples show that printing the Vob directly gives the bits in the correct order.

Hmm, I'm not sure what's wrong. Is it possible that the code fails when iter_storage has more than one element? My example code does next().unwrap() and I haven't tested anything else.

Sir, am I correct in understanding that you're trying to make this crate so the result can be used as a key in a database? If so, wouldn't it need to be, y'know, comparable?

Yes, but I think this just comes back to #9

Panic if there's an error... (Can there be one? Who knows!)

For avoidance of doubt: At the moment, no there can not but I made the API use Result to avoid breaks in future.

By performing the f32 as i32 change, you aren't transforming the f32 to its binary representation; you're throwing all the decimal digits away and only keeping the integer part.

Yes, it's described in the DynamoDB blog post that z-order indexing requires thinking about your schema's bit-representation and making sure that your encoding works for your use case.

So I agree that there's no general purpose solution to the z-indexing database key problem, but disagree that zdex can't be used for it in the general sense.

For a longitude/latitude z-order index, 64 bits are complete overkill.

Probably it doesn't matter

Subdividing the latitude and longitude into equal angles does not lead to equal areas of subdivision!

The use-case is for database query constraints.

  1. The whole point is to use lon-lat min/max to create a bounding box, which doesn't itself as a use-case have any problems if you don't have equal subdivision, right?

  2. As described in the DynamoDB post, z-indexing produces results which are not in the bounding box, so even if you do some magic to produce equal-size bit-representation boxes, you still need to filter the output. That's why is_relevant and next_jump_in #2 is needed.

As you can see, even this was actually very complicated, and most importantly it does not generalise to other (f32, f32) pairs. I admire the work you set out to do, but I fear that it does not have a general solution.

I think this comes down to a question: is it worth providing default transforms for floating point types?

I think yes, as long as the limitations are documented.


I am not really sure what next-steps are. I didn't check your code, but I suspect that I can't yet make zdex utilise morton-encoding, is that right?

@DoubleHyphen
Copy link
Author

First things first: Happy new year!

I'll answer your post in more detail tomorrow. For now, however, I need to ask you something much more important:

For your use-case, a Hilbert index would be in all ways completely superior to a Z-index, wouldn't it?

I ask this because over at the hilbert crate there's someone who asked the exact same things as you did, for the exact same use-case. Someone else chimed in and quoted the code he's written which, he says, can do exactly that. (Note: I haven't yet had the time to study it more closely.) Thus, if the new guy's code can do what you want, better than you'd hoped, and faster than your current implementation, would your needs be met? After all, the only drawback that the Hilbert index compared to the Z-index is that it's harder to compute, which (as you've made abundantly clear) is a complete non-issue for you.

@blakehawkins
Copy link
Owner

It doesn't seem likely that understanding the hilbert encoding would be a worthwhile investment given that I need #2 and #3, which may take deep investigation/understanding.

@DoubleHyphen I recommend that we close this issue if there isn't a library readily available to act as a drop-in replacement for BitCollection with all of the requirements outlined above (and the snippet in that hilbert link is far from close enough, frankly)

@DoubleHyphen
Copy link
Author

Yes, I agree with the closure of this issue. Any productive results that could have arisen out of it have already arisen.

Slinks away as inconspicuously as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants