refactor: replace Range with a bounded implementation #112

baszalmstra · 2022-05-25T16:06:34Z

This PR replaces Range<T> with an implementation that allows inclusive or exclusive bounds to be used. This enables T to be any type that implements Ord + Clone and doesn't require the Version trait.

It also renames some of the functions from Range to be more aligned with the names used in the VersionSet trait.

This is a cleaned-up version of #111 .

src/range.rs

src/internal/small_vec.rs

src/range.rs

Eh2406 · 2022-05-27T20:23:47Z

benches/large_case.rs


-fn bench<'a, P: Package + Deserialize<'a>, V: Version + Hash + Deserialize<'a>>(
+fn bench<'a, P: Package + Deserialize<'a>, V: VersionSet + Deserialize<'a>>(


Can we call it VS: not V:, I think that will be more consistent with our usage in the rest of the package?
Then the where can become where VS::V: Deserialize<'a>.

Eh2406

This looks good at this point. I have to admit my eyes glazed over while looking at intersection, but I have confidence in our tests.

Eh2406 · 2022-05-28T23:54:06Z

src/range.rs

+            .prop_map(|((start_bounded, end_bounded), mut vec)| {
+                // Ensure the bounds are increasing and non-repeating
+                vec.sort_by_key(|(value, _)| *value);
+                vec.dedup_by_key(|(value, _)| *value);


Thinking about this more I'm now realizing that this cannot generate [(Unbounded, Excluded(1)), (Excluded(1), Unbounded)]. But I'm not seeing an obvious way to fix it.

Here is an alternative strategy ad38972 thoughts?

I like the idea of using deltas instead of sorting values! I have not read in all details the generation, it needs a bit more comments and explaining I think. I also thought there might be a logic swap between "bounded" and "unbounded" in names but it may be that I just didn't pay enough attention (in all cases, it may need more comments).

Here is what I have in mind after reading your proposal @Eh2406 :

generate random start that is one of included 0 ( = unbounded ) | excluded 0 | non 0 (included or excluded)

generate random vec of deltas

dedup successive 0 deltas, there can only be one 0 surrounded by non zeros

for each delta, alternate between start and end bounds

generate random Excluded | Included tags for each bound (with valid constraints for 0 deltas)

if the last bound is a start bound (depends on start and length of generated vector), add an unbounded at the end

If that's already what you did great! (you can add comments then) If not, can you please let us know how it differs and maybe show how your generator is better or how we can make a mix of this rough outline and the one you proposed to have something that have the best properties?

I tried rewriting the strategy based on your proposal (based on @Eh2406 code) and found that it's basically almost what you describe @mpizenberg . I've committed a version that includes comments (c44639a).

Eh2406 · 2022-05-29T00:07:59Z

src/range.rs

    }

-    /// Compute the intersection of two sets of versions.
+    /// Computes the intersection of two sets of versions.
    pub fn intersection(&self, other: &Self) -> Self {


I spent time today trying to write a "more obviously correct" intersection, with the same perf. I did not succeed. I did find this method helpful for catching corner cases:

fn check_invariants(&self) { if cfg!(debug_assertions) { for (i, (s, e)) in self.segments.iter().enumerate() { if matches!(s, Unbounded) && i != 0 { panic!() } if matches!(e, Unbounded) && i != (self.segments.len() - 1) { panic!() } } for p in self.segments.as_slice().windows(2) { match (&p[0].1, &p[1].0) { (Included(l_end), Included(r_start)) => assert!(l_end < r_start), (Included(l_end), Excluded(r_start)) => assert!(l_end < r_start), (Excluded(l_end), Included(r_start)) => assert!(l_end < r_start), (Excluded(l_end), Excluded(r_start)) => assert!(l_end <= r_start), (_, Unbounded) => panic!(), (Unbounded, _) => panic!(), } } for (s, e) in self.segments.iter() { assert!(match (s, e) { (Included(s), Included(e)) => s <= e, (Included(s), Excluded(e)) => s < e, (Excluded(s), Included(e)) => s < e, (Excluded(s), Excluded(e)) => s < e, (Unbounded, _) | (_, Unbounded) => true, }); } } }

Perhaps we can add a call to it to the end of all methods that construct a range?

This is as close as I came:
6ac1e06?diff=split

perf is slightly worse, but I find it more readable. what do you think?

I do like that its a lot shorter and easier to follow, my code had a lot of cases. How much worse is the performance?

Rerunning the benchmarks now I am not seeing significant differences between our two implementations! I guess I shouldn't try benchmarking at 2 o'clock in the morning.

Haha, Ill copy your implementation into the MR. Should I also add the invariant check?

Or can I simply merge your changes?

merge/copy/rewrite, As you wish.

I like the check invariants function! Is the first for check not included in the second already btw?

I cherry-picked your code (crediting you) and added the suggestion about next_if by @mpizenberg

Eh2406 · 2022-06-06T20:27:10Z

I am comfortable merging this. I like my impl of intersection, but happy either way.
I would like to see a fix for #112 (comment), but do not see how.
@mpizenberg, what are you thinking about this PR?

mpizenberg · 2022-06-07T12:07:56Z

@Eh2406 I very much prefer the simplicity of your intersection function! it's clear, and shorter so much easier to maintain and to serve as an example for others. So if there is no or a slight performance cost, I'm in favor of this.

Regarding proptest generators, I'd also be more confident if we can generate all kinds of possible version sets so I'd like to have something like what is proposing @Eh2406 . I added some comments of my own on that proposal. @baszalmstra and @Eh2406 I'd love some feedback on those.

Side note. I've been a bit busy and also have an event this weekend. But I've arranged to start early July with my next job, so I should have the second part of June to make a push toward my docs goals.

Co-authored-by: Jacob Finkelman <[email protected]>

baszalmstra · 2022-06-07T16:54:32Z

I added the intersection code and proptest strategy (with comments) from @Eh2406 . Let me know what you guys think!

mpizenberg · 2022-06-07T17:01:28Z

src/range.rs

+        (Included(s), Excluded(e)) => s < e,
+        (Excluded(s), Included(e)) => s < e,
+        (Excluded(s), Excluded(e)) => s < e,
+        (Unbounded, _) | (_, Unbounded) => true,


Actually, there might be representations special cases where we have an invalid segment that cannot be checked by this function like (Unbounded, Excluded(0)) in the case where versions are u32. It depends if there are other ways to represent the left-most and right-most bounds with the version type.

What do you guys think of that? negligible?

mpizenberg · 2022-06-07T17:55:52Z

Another remark for something I'm just realizing now. Bounds for discrete versions sets may introduce comparison errors? Can we end up with the two following sets of u32 that are the same but not structurally equal?

1: (3, 7)
2: (3, 4) (5,7)

Actually now that I rethink about the simple intersection code, I'm not even sure it preserves the "uniqueness of representation" property required for sets comparisons. I didn't see special handling of same boundaries. Like what happens if we ask the union of [x, y] and [y, z]? Does it answer [x, z] as it should?

If we translate this union into intersection job, it means we will compute intersection of ]_, x[ + ]y, _[ with ]_, y[ + ]z, _[. If I follow the simpler algorithm, that intersection gives ]_,x[ + ]y,y[ + ]z,_[ and since ]y,y[ is invalid, that segment is not pushed so the result i ]_,x[ + ]z,_[ and inverting gives the correct [x,z]. Okay! (sorry for thinking out loud with text)

But is it possible to arrive at a situation where we have already pushed (a,b) and b is now also the start of a new segment? If that was possible, it would mean that b is at the same time the smallest of two end bounds, and the biggest of two start bounds. But if the input sets are valid, following segments in each VersionSet are valid, meaning it can only be possible if the end bound comes from one input and the start bound from the other input. But if we pick the end bound of one input, it means the end bound of the other is >= than the one we picked. So the following start bound of that other input is necessarily valid since it was valid for that other segment. Okay sorry about the wordy. Let me write it in math.

Let vs1 and vs2 be two valid version sets that we want to compute the intersection.
Let (a1, b1) + (c1, d1) be a part of vs1 composed of two segments and (a2, b2) + (c2, d2) be a part of vs2 also composed of two segments.
If b1 <= b2 then (_, b1) + (c2,_) is also a valid succession of segments so we are all good when picking the smallest of two end bounds when computing intersections.

mpizenberg · 2022-06-07T17:58:33Z

Ok sorry for the long text of thinking out loud. I think I'm convinced it works as intended for continuous spaces of versions (would love a confirmation from your thinking). But there is still the problem of non unique representations in the case of discrete spaces where things like [a, b] + [next(b), c] should be represented instead by [a, c]. What do you think? Does this mean we need more complex equality implementations that are not derived from structural equality?

Eh2406 · 2022-06-07T18:55:53Z

I would love to be able to apply formal methods to make sure our implementations are correct. Even so based on my current understanding, if the inputs are structurally valid than all of the operations are correct (and structurally valid).

But there is still the problem of non unique representations in the case of discrete spaces where things like [a, b] + [next(b), c] should be represented instead by [a, c]. What do you think? Does this mean we need more complex equality implementations that are not derived from structural equality?

I don't think this is critical to correctness. The algorithm may end up having to ask the dependency provider about if there are versions in ]b, next(b)[ and be told by the dependency provider that there don't happen to be any such versions. This is inefficient, leading to more calls to the dependency provider and more meaningless fluff in the explanation of an error, but it will still be correct. This inefficiency is why I think it is worth documenting the DiscreteRange so that people can optimize for it if it matters to their use case.

mpizenberg · 2022-06-07T19:12:57Z

I was more thinking about the part that computes sets relations as said in the guide:

Checking if a term is satisfied by another term is accomplished in the code by verifying if the intersection of the two terms equals the second term. It is thus very important that terms have unique representations, and by consequence also that ranges have a unique representation.

So wondering if we could end with a situation where we have two terms t1 and t2 as follows.

     a            b    c    next(c)   d         e
t1 = [-----------------]      [-----------------]
t2 =              [-------------------]

And the intersection is computed as [b, c] + [next(c), d] which is not structurally equal to [b, d], resulting in the relation computed (satisfied, not satisfied, etc) being incorrect, and messing up with the following branch of the code taken in the solver.

mpizenberg · 2022-06-07T19:21:49Z

But yeah you're probably right. Maybe this is still correct in a sense, and will just push the algorithm into a state where it needs more work to be done, and not to a wrong state. And then the only inconvenience is performance.

mpizenberg · 2022-06-07T19:29:28Z

I'd still love if we could add a warning about potential non-unique representations of version sets in the code. And potential implications it may have, even if it turns out that in practice, with only valid input we never end up compromising the solver properties. At least have it documented in code comments somewhere in the bounded implementation.

When that is done, and if you guys are confident we can go forward then let's go with this :)

baszalmstra · 2022-06-09T07:40:51Z

Ill add a comment in the range module documentation!

baszalmstra · 2022-06-10T18:27:43Z

I added a comment, does it help explain this potential issue?

Eh2406 · 2022-06-16T13:12:37Z

@mpizenberg do you think this is good for merge?

mpizenberg · 2022-06-19T16:17:30Z

Sorry for the long time before answering @baszalmstra. I've been moving around a lot.

So I'd like to be extra clear that unique representations is an assumption made by the solver and not following that constraint is a possible source of bugs. Until now, this was only a comment in the guide, but since we are making bounded segments available in the API and it clearly enables different representations this needs to be clearly mentioned in the code. What do you guys think of a comment like the following one.

In order to advance the solver front, comparisons of versions sets are necessary in the algorithm. To do those comparisons between two sets S1 and S2 we use the mathematical property that S1 ⊂ S2 if and only if S1 ∩ S2 == S1. We can thus compute an intersection and evaluate an equality to answer if S1 is a subset of S2. But this means that the implementation of equality must be correct semantically. In practice, if equality is derived automatically, this means sets must have unique representations.

By migrating from a custom representation for discrete sets in v0.2 to a generic bounded representation for continuous sets in v0.3 we are potentially breaking that assumption in two ways:

Minimal and maximal Unbounded values can be replaced by their equivalent if it exists.

Simplifying adjacent bounds of discrete sets cannot be detected and automated in the generic intersection code.

An example for each can be given when T is u32. First, we can have both segments S1 = (Unbounded, Included(42u32)) and S2 = (Included(0), Included(42u32)) that represent the same segment but are structurally different. Thus, a derived equality check would answer false to S1 == S2 while it's true.

Second both segments S1 = (Included(1), Included(5)) and S2 = (Included(1), Included(3)) + (Included(4), Included(5)) are equal. But without asking the user to provide a bump function for discrete sets, the algorithm is not able tell that the space between the right Included(3) bound and the left Included(4) bound is empty. Thus the algorithm is not able to reduce S2 to its canonical S1 form while computing sets operations like intersections in the generic code.

We are aware that this behavior may be a source of hard to track bugs, but considering how the intersection code and the rest of the solver are currently implemented, we did not found this to lead to bugs in practice. So we are keeping the requirements simple and keeping a single generic implementation for now. We are also keeping this warning until a formal proof that the code cannot lead to error states.

mpizenberg · 2022-06-19T16:35:02Z

If you guys are ok with my comment above to add it in the code, or something similar, that's my last nitpick I think. After that it's ready to merge in my opinion, so no need to wait for me if I'm not responsive in the coming days.

Eh2406 · 2022-06-21T15:15:28Z

As usual your writing is articulate and clear! I have no objection to adding that anywhere you would like.

I would prefer a softer version of the last paragraph. (But not enough to stop getting things actually merged.) How about:

This is likely to lead to user facing theoretically correct but practically nonsensical ranges, like (Unbounded, Excluded(0)) or (Excluded(6), Excluded(7)). In general nonsensical inputs often lead to hard to track bugs. But as far as we can tell this should work in practice. So for now this crate only provides an implementation for continuous ranges. With the v0.3 api the user could choose to bring back the discrete implementation from v0.2, as documented in the guide. If doing so regularly fixes bugs seen by users, we will bring it back into the core library. If we do not see practical bugs, or we get a formal proof that the code cannot lead to error states, then we may remove this warning.

mpizenberg · 2022-06-21T17:14:28Z

Yep, your variation of the text is good too.
The best place should be the code documentation of the module where we have our bounded implementation of version sets, so currently I believe it's the range module. Where @baszalmstra chose to put it is fine I think.

baszalmstra · 2022-06-22T06:41:13Z

@Eh2406 Im on holiday the next week, are you able to make the above changes? “Allow edits by maintainers” is enabled.

Eh2406 · 2022-06-22T14:14:17Z

I will try. Enjoy your holiday!

baszalmstra · 2022-06-25T06:38:35Z

I just thought of something. I remember that in some places in the code a comparison is made to an empty set (like here

pubgrub/src/term.rs

Line 146 in 717289b

terms_intersection.intersection(self) == Self::empty()

). However, if we can have multiple representations of the empty set, like in the case with (Unbounded, Excluded(0)) the check will fail! I think thats one case where the solver will not properly progress further right?

mpizenberg · 2022-06-25T09:54:38Z

Please let me the afternoon check something before merging. There is something I want to check.

Eh2406 · 2022-06-25T16:34:39Z

I will try. Enjoy your holiday!

I did not get time over this work week for any open source work. Sorry. I have time today, if I can still be helpful.

However, if we can have multiple representations of the empty set, like in the case with (Unbounded, Excluded(0)) the check will fail! I think thats one case where the solver will not properly progress further right?

It is https://github.com/pubgrub-rs/pubgrub/blob/dev/src/term.rs#L154 in the non-test code.
I think this is exactly the case where the solver will waste some cycles.
It should return Relation::Contradicted but will return Relation::Inconclusive.
That will make https://github.com/pubgrub-rs/pubgrub/blob/dev/src/internal/incompatibility.rs#L219
It should return Relation::Contradicted but will return Relation::AlmostSatisfied or Relation::Inconclusive.
This is all happening in unit_propagation https://github.com/pubgrub-rs/pubgrub/blob/dev/src/internal/core.rs#L113
If relation gives Relation::AlmostSatisfied, then it will remove the (Unbounded, Excluded(0)) from the partial_solution. So, the problem will resolve itself next cycle.
If relation gives Relation::Inconclusive, then the mess stays in the partial_solution as is. When https://github.com/pubgrub-rs/pubgrub/blob/release/src/solver.rs#L110 the dependency_provider picks that package it will return that there are no versions in (Unbounded, Excluded(0)), and the problem will resolve itself next cycle.

Eh2406

Looking good. Document as you wish and Merge when you are happy.

mpizenberg · 2022-06-25T21:20:57Z

Nevermind, I realized what I wanted to check for the result of intersections was already in the check_invariants function. Tough while re-reading that function I removed the first for loop which is already covered in the following for loop.

I also made two more restrictive change in the random generator of ranges. (1) if delta is 0 only double inclusive segments are valid. There cannot be an inclusive and an exclusive bound because that's an empty segment and these are not valid, as per the check_invariants function. And (2) if delta is 0 between two segments, it can only be a double exclusive, for the same reason that otherwise we have an empty space between two segments and this is forbidden in check_invariants.

Considering these two situations were supposed to be possible previously in the random generator, I'm surprised we didn't end with failing tests due to the call to check_invariants at the end of the generator. Do you have any idea?

mpizenberg · 2022-06-25T21:37:45Z

I also renamed the variable start_bounded into start_unbounded in the generator. I think it was unintentionally swapped. Let me know if I'm mistaken.
Otherwise I think it's now good to merge!
Thanks a lot @baszalmstra and @Eh2406

* refactor: replace Range with a bounded implementation * fix: rewrite range proptest strategy * fix: deserialize SmallVec without Vec alloc * fix: remove not_equals * fix: re-add union and remove early out * fix: renamed V to VS in bench * refactor: simpler intersection Co-authored-by: Jacob Finkelman <[email protected]> * test: use deltas for range strategy Co-authored-by: Jacob Finkelman <[email protected]> * docs(range): added comment about discrete values * More restrictive for valid random range generation * Remove duplicate check in check_invariants * Add warning about non-unique ranges representations * Rename start_bounded into start_unbounded Co-authored-by: Jacob Finkelman <[email protected]> Co-authored-by: Matthieu Pizenberg <[email protected]>

refactor: replace Range with a bounded implementation

c3a9567

baszalmstra mentioned this pull request May 25, 2022

feat: add inclusive and exclusive bounds to Range #111

Closed