Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: replace Range with a bounded implementation #112

Merged
merged 13 commits into from
Jun 25, 2022

Conversation

baszalmstra
Copy link
Contributor

This PR replaces Range<T> with an implementation that allows inclusive or exclusive bounds to be used. This enables T to be any type that implements Ord + Clone and doesn't require the Version trait.

It also renames some of the functions from Range to be more aligned with the names used in the VersionSet trait.

This is a cleaned-up version of #111 .

src/range.rs Outdated Show resolved Hide resolved
src/range.rs Outdated Show resolved Hide resolved
src/range.rs Outdated Show resolved Hide resolved
src/range.rs Show resolved Hide resolved

fn bench<'a, P: Package + Deserialize<'a>, V: Version + Hash + Deserialize<'a>>(
fn bench<'a, P: Package + Deserialize<'a>, V: VersionSet + Deserialize<'a>>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it VS: not V:, I think that will be more consistent with our usage in the rest of the package?
Then the where can become where VS::V: Deserialize<'a>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Eh2406 Eh2406 added this to the v0.3 milestone May 27, 2022
Copy link
Member

@Eh2406 Eh2406 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good at this point. I have to admit my eyes glazed over while looking at intersection, but I have confidence in our tests.

src/range.rs Outdated
.prop_map(|((start_bounded, end_bounded), mut vec)| {
// Ensure the bounds are increasing and non-repeating
vec.sort_by_key(|(value, _)| *value);
vec.dedup_by_key(|(value, _)| *value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more I'm now realizing that this cannot generate [(Unbounded, Excluded(1)), (Excluded(1), Unbounded)]. But I'm not seeing an obvious way to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an alternative strategy ad38972 thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using deltas instead of sorting values! I have not read in all details the generation, it needs a bit more comments and explaining I think. I also thought there might be a logic swap between "bounded" and "unbounded" in names but it may be that I just didn't pay enough attention (in all cases, it may need more comments).

Here is what I have in mind after reading your proposal @Eh2406 :

  • generate random start that is one of included 0 ( = unbounded ) | excluded 0 | non 0 (included or excluded)
  • generate random vec of deltas
  • dedup successive 0 deltas, there can only be one 0 surrounded by non zeros
  • for each delta, alternate between start and end bounds
  • generate random Excluded | Included tags for each bound (with valid constraints for 0 deltas)
  • if the last bound is a start bound (depends on start and length of generated vector), add an unbounded at the end

If that's already what you did great! (you can add comments then) If not, can you please let us know how it differs and maybe show how your generator is better or how we can make a mix of this rough outline and the one you proposed to have something that have the best properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried rewriting the strategy based on your proposal (based on @Eh2406 code) and found that it's basically almost what you describe @mpizenberg . I've committed a version that includes comments (c44639a).

}

/// Compute the intersection of two sets of versions.
/// Computes the intersection of two sets of versions.
pub fn intersection(&self, other: &Self) -> Self {
Copy link
Member

@Eh2406 Eh2406 May 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent time today trying to write a "more obviously correct" intersection, with the same perf. I did not succeed. I did find this method helpful for catching corner cases:

    fn check_invariants(&self) {
        if cfg!(debug_assertions) {
            for (i, (s, e)) in self.segments.iter().enumerate() {
                if matches!(s, Unbounded) && i != 0 {
                    panic!()
                }
                if matches!(e, Unbounded) && i != (self.segments.len() - 1) {
                    panic!()
                }
            }
            for p in self.segments.as_slice().windows(2) {
                match (&p[0].1, &p[1].0) {
                    (Included(l_end), Included(r_start)) => assert!(l_end < r_start),
                    (Included(l_end), Excluded(r_start)) => assert!(l_end < r_start),
                    (Excluded(l_end), Included(r_start)) => assert!(l_end < r_start),
                    (Excluded(l_end), Excluded(r_start)) => assert!(l_end <= r_start),
                    (_, Unbounded) => panic!(),
                    (Unbounded, _) => panic!(),
                }
            }
            for (s, e) in self.segments.iter() {
                assert!(match (s, e) {
                    (Included(s), Included(e)) => s <= e,
                    (Included(s), Excluded(e)) => s < e,
                    (Excluded(s), Included(e)) => s < e,
                    (Excluded(s), Excluded(e)) => s < e,
                    (Unbounded, _) | (_, Unbounded) => true,
                });
            }
        }
    }

Perhaps we can add a call to it to the end of all methods that construct a range?

Copy link
Member

@Eh2406 Eh2406 May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as close as I came:
6ac1e06?diff=split

perf is slightly worse, but I find it more readable. what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like that its a lot shorter and easier to follow, my code had a lot of cases. How much worse is the performance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rerunning the benchmarks now I am not seeing significant differences between our two implementations! I guess I shouldn't try benchmarking at 2 o'clock in the morning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, Ill copy your implementation into the MR. Should I also add the invariant check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can I simply merge your changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge/copy/rewrite, As you wish.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the check invariants function! Is the first for check not included in the second already btw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cherry-picked your code (crediting you) and added the suggestion about next_if by @mpizenberg

@baszalmstra baszalmstra force-pushed the feat/inclusive_range branch from 536f041 to a1ea900 Compare May 29, 2022 20:17
@Eh2406
Copy link
Member

Eh2406 commented Jun 6, 2022

I am comfortable merging this. I like my impl of intersection, but happy either way.
I would like to see a fix for #112 (comment), but do not see how.
@mpizenberg, what are you thinking about this PR?

@mpizenberg
Copy link
Member

@Eh2406 I very much prefer the simplicity of your intersection function! it's clear, and shorter so much easier to maintain and to serve as an example for others. So if there is no or a slight performance cost, I'm in favor of this.

Regarding proptest generators, I'd also be more confident if we can generate all kinds of possible version sets so I'd like to have something like what is proposing @Eh2406 . I added some comments of my own on that proposal. @baszalmstra and @Eh2406 I'd love some feedback on those.

Side note. I've been a bit busy and also have an event this weekend. But I've arranged to start early July with my next job, so I should have the second part of June to make a push toward my docs goals.

baszalmstra and others added 2 commits June 7, 2022 18:15
@baszalmstra
Copy link
Contributor Author

I added the intersection code and proptest strategy (with comments) from @Eh2406 . Let me know what you guys think!

(Included(s), Excluded(e)) => s < e,
(Excluded(s), Included(e)) => s < e,
(Excluded(s), Excluded(e)) => s < e,
(Unbounded, _) | (_, Unbounded) => true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there might be representations special cases where we have an invalid segment that cannot be checked by this function like (Unbounded, Excluded(0)) in the case where versions are u32. It depends if there are other ways to represent the left-most and right-most bounds with the version type.

What do you guys think of that? negligible?

@mpizenberg
Copy link
Member

Another remark for something I'm just realizing now. Bounds for discrete versions sets may introduce comparison errors? Can we end up with the two following sets of u32 that are the same but not structurally equal?

1: (3, 7)
2: (3, 4) (5,7)

Actually now that I rethink about the simple intersection code, I'm not even sure it preserves the "uniqueness of representation" property required for sets comparisons. I didn't see special handling of same boundaries. Like what happens if we ask the union of [x, y] and [y, z]? Does it answer [x, z] as it should?

If we translate this union into intersection job, it means we will compute intersection of ]_, x[ + ]y, _[ with ]_, y[ + ]z, _[. If I follow the simpler algorithm, that intersection gives ]_,x[ + ]y,y[ + ]z,_[ and since ]y,y[ is invalid, that segment is not pushed so the result i ]_,x[ + ]z,_[ and inverting gives the correct [x,z]. Okay! (sorry for thinking out loud with text)

But is it possible to arrive at a situation where we have already pushed (a,b) and b is now also the start of a new segment? If that was possible, it would mean that b is at the same time the smallest of two end bounds, and the biggest of two start bounds. But if the input sets are valid, following segments in each VersionSet are valid, meaning it can only be possible if the end bound comes from one input and the start bound from the other input. But if we pick the end bound of one input, it means the end bound of the other is >= than the one we picked. So the following start bound of that other input is necessarily valid since it was valid for that other segment. Okay sorry about the wordy. Let me write it in math.

Let vs1 and vs2 be two valid version sets that we want to compute the intersection.
Let (a1, b1) + (c1, d1) be a part of vs1 composed of two segments and (a2, b2) + (c2, d2) be a part of vs2 also composed of two segments.
If b1 <= b2 then (_, b1) + (c2,_) is also a valid succession of segments so we are all good when picking the smallest of two end bounds when computing intersections.

@mpizenberg
Copy link
Member

mpizenberg commented Jun 7, 2022

Ok sorry for the long text of thinking out loud. I think I'm convinced it works as intended for continuous spaces of versions (would love a confirmation from your thinking). But there is still the problem of non unique representations in the case of discrete spaces where things like [a, b] + [next(b), c] should be represented instead by [a, c]. What do you think? Does this mean we need more complex equality implementations that are not derived from structural equality?

@Eh2406
Copy link
Member

Eh2406 commented Jun 7, 2022

I would love to be able to apply formal methods to make sure our implementations are correct. Even so based on my current understanding, if the inputs are structurally valid than all of the operations are correct (and structurally valid).

But there is still the problem of non unique representations in the case of discrete spaces where things like [a, b] + [next(b), c] should be represented instead by [a, c]. What do you think? Does this mean we need more complex equality implementations that are not derived from structural equality?

I don't think this is critical to correctness. The algorithm may end up having to ask the dependency provider about if there are versions in ]b, next(b)[ and be told by the dependency provider that there don't happen to be any such versions. This is inefficient, leading to more calls to the dependency provider and more meaningless fluff in the explanation of an error, but it will still be correct. This inefficiency is why I think it is worth documenting the DiscreteRange so that people can optimize for it if it matters to their use case.

@mpizenberg
Copy link
Member

I was more thinking about the part that computes sets relations as said in the guide:

Checking if a term is satisfied by another term is accomplished in the code by verifying if the intersection of the two terms equals the second term. It is thus very important that terms have unique representations, and by consequence also that ranges have a unique representation.

So wondering if we could end with a situation where we have two terms t1 and t2 as follows.

     a            b    c    next(c)   d         e
t1 = [-----------------]      [-----------------]
t2 =              [-------------------]

And the intersection is computed as [b, c] + [next(c), d] which is not structurally equal to [b, d], resulting in the relation computed (satisfied, not satisfied, etc) being incorrect, and messing up with the following branch of the code taken in the solver.

@mpizenberg
Copy link
Member

But yeah you're probably right. Maybe this is still correct in a sense, and will just push the algorithm into a state where it needs more work to be done, and not to a wrong state. And then the only inconvenience is performance.

@mpizenberg
Copy link
Member

I'd still love if we could add a warning about potential non-unique representations of version sets in the code. And potential implications it may have, even if it turns out that in practice, with only valid input we never end up compromising the solver properties. At least have it documented in code comments somewhere in the bounded implementation.

When that is done, and if you guys are confident we can go forward then let's go with this :)

@baszalmstra
Copy link
Contributor Author

Ill add a comment in the range module documentation!

@baszalmstra
Copy link
Contributor Author

I added a comment, does it help explain this potential issue?

@Eh2406
Copy link
Member

Eh2406 commented Jun 16, 2022

@mpizenberg do you think this is good for merge?

@mpizenberg
Copy link
Member

mpizenberg commented Jun 19, 2022

Sorry for the long time before answering @baszalmstra. I've been moving around a lot.

So I'd like to be extra clear that unique representations is an assumption made by the solver and not following that constraint is a possible source of bugs. Until now, this was only a comment in the guide, but since we are making bounded segments available in the API and it clearly enables different representations this needs to be clearly mentioned in the code. What do you guys think of a comment like the following one.

In order to advance the solver front, comparisons of versions sets are necessary in the algorithm. To do those comparisons between two sets S1 and S2 we use the mathematical property that S1 ⊂ S2 if and only if S1 ∩ S2 == S1. We can thus compute an intersection and evaluate an equality to answer if S1 is a subset of S2. But this means that the implementation of equality must be correct semantically. In practice, if equality is derived automatically, this means sets must have unique representations.

By migrating from a custom representation for discrete sets in v0.2 to a generic bounded representation for continuous sets in v0.3 we are potentially breaking that assumption in two ways:

  1. Minimal and maximal Unbounded values can be replaced by their equivalent if it exists.
  2. Simplifying adjacent bounds of discrete sets cannot be detected and automated in the generic intersection code.

An example for each can be given when T is u32. First, we can have both segments S1 = (Unbounded, Included(42u32)) and S2 = (Included(0), Included(42u32)) that represent the same segment but are structurally different. Thus, a derived equality check would answer false to S1 == S2 while it's true.

Second both segments S1 = (Included(1), Included(5)) and S2 = (Included(1), Included(3)) + (Included(4), Included(5)) are equal. But without asking the user to provide a bump function for discrete sets, the algorithm is not able tell that the space between the right Included(3) bound and the left Included(4) bound is empty. Thus the algorithm is not able to reduce S2 to its canonical S1 form while computing sets operations like intersections in the generic code.

We are aware that this behavior may be a source of hard to track bugs, but considering how the intersection code and the rest of the solver are currently implemented, we did not found this to lead to bugs in practice. So we are keeping the requirements simple and keeping a single generic implementation for now. We are also keeping this warning until a formal proof that the code cannot lead to error states.

@mpizenberg
Copy link
Member

If you guys are ok with my comment above to add it in the code, or something similar, that's my last nitpick I think. After that it's ready to merge in my opinion, so no need to wait for me if I'm not responsive in the coming days.

@Eh2406
Copy link
Member

Eh2406 commented Jun 21, 2022

As usual your writing is articulate and clear! I have no objection to adding that anywhere you would like.

I would prefer a softer version of the last paragraph. (But not enough to stop getting things actually merged.) How about:

This is likely to lead to user facing theoretically correct but practically nonsensical ranges, like (Unbounded, Excluded(0)) or (Excluded(6), Excluded(7)). In general nonsensical inputs often lead to hard to track bugs. But as far as we can tell this should work in practice. So for now this crate only provides an implementation for continuous ranges. With the v0.3 api the user could choose to bring back the discrete implementation from v0.2, as documented in the guide. If doing so regularly fixes bugs seen by users, we will bring it back into the core library. If we do not see practical bugs, or we get a formal proof that the code cannot lead to error states, then we may remove this warning.

@mpizenberg
Copy link
Member

Yep, your variation of the text is good too.
The best place should be the code documentation of the module where we have our bounded implementation of version sets, so currently I believe it's the range module. Where @baszalmstra chose to put it is fine I think.

@baszalmstra
Copy link
Contributor Author

@Eh2406 Im on holiday the next week, are you able to make the above changes? “Allow edits by maintainers” is enabled.

@Eh2406
Copy link
Member

Eh2406 commented Jun 22, 2022

I will try. Enjoy your holiday!

@baszalmstra
Copy link
Contributor Author

I just thought of something. I remember that in some places in the code a comparison is made to an empty set (like here

terms_intersection.intersection(self) == Self::empty()
). However, if we can have multiple representations of the empty set, like in the case with (Unbounded, Excluded(0)) the check will fail! I think thats one case where the solver will not properly progress further right?

@mpizenberg
Copy link
Member

Please let me the afternoon check something before merging. There is something I want to check.

@Eh2406
Copy link
Member

Eh2406 commented Jun 25, 2022

I will try. Enjoy your holiday!

I did not get time over this work week for any open source work. Sorry. I have time today, if I can still be helpful.

However, if we can have multiple representations of the empty set, like in the case with (Unbounded, Excluded(0)) the check will fail! I think thats one case where the solver will not properly progress further right?

It is https://github.com/pubgrub-rs/pubgrub/blob/dev/src/term.rs#L154 in the non-test code.
I think this is exactly the case where the solver will waste some cycles.
It should return Relation::Contradicted but will return Relation::Inconclusive.
That will make https://github.com/pubgrub-rs/pubgrub/blob/dev/src/internal/incompatibility.rs#L219
It should return Relation::Contradicted but will return Relation::AlmostSatisfied or Relation::Inconclusive.
This is all happening in unit_propagation https://github.com/pubgrub-rs/pubgrub/blob/dev/src/internal/core.rs#L113
If relation gives Relation::AlmostSatisfied, then it will remove the (Unbounded, Excluded(0)) from the partial_solution. So, the problem will resolve itself next cycle.
If relation gives Relation::Inconclusive, then the mess stays in the partial_solution as is. When https://github.com/pubgrub-rs/pubgrub/blob/release/src/solver.rs#L110 the dependency_provider picks that package it will return that there are no versions in (Unbounded, Excluded(0)), and the problem will resolve itself next cycle.

Copy link
Member

@Eh2406 Eh2406 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Document as you wish and Merge when you are happy.

@mpizenberg
Copy link
Member

mpizenberg commented Jun 25, 2022

Nevermind, I realized what I wanted to check for the result of intersections was already in the check_invariants function. Tough while re-reading that function I removed the first for loop which is already covered in the following for loop.

I also made two more restrictive change in the random generator of ranges. (1) if delta is 0 only double inclusive segments are valid. There cannot be an inclusive and an exclusive bound because that's an empty segment and these are not valid, as per the check_invariants function. And (2) if delta is 0 between two segments, it can only be a double exclusive, for the same reason that otherwise we have an empty space between two segments and this is forbidden in check_invariants.

Considering these two situations were supposed to be possible previously in the random generator, I'm surprised we didn't end with failing tests due to the call to check_invariants at the end of the generator. Do you have any idea?

@mpizenberg
Copy link
Member

I also renamed the variable start_bounded into start_unbounded in the generator. I think it was unintentionally swapped. Let me know if I'm mistaken.
Otherwise I think it's now good to merge!
Thanks a lot @baszalmstra and @Eh2406

@Eh2406 Eh2406 merged commit 15ac3c7 into pubgrub-rs:dev Jun 25, 2022
zanieb pushed a commit to astral-sh/pubgrub that referenced this pull request Nov 8, 2023
* refactor: replace Range with a bounded implementation

* fix: rewrite range proptest strategy

* fix: deserialize SmallVec without Vec alloc

* fix: remove not_equals

* fix: re-add union and remove early out

* fix: renamed V to VS in bench

* refactor: simpler intersection

Co-authored-by: Jacob Finkelman <[email protected]>

* test: use deltas for range strategy

Co-authored-by: Jacob Finkelman <[email protected]>

* docs(range): added comment about discrete values

* More restrictive for valid random range generation

* Remove duplicate check in check_invariants

* Add warning about non-unique ranges representations

* Rename start_bounded into start_unbounded

Co-authored-by: Jacob Finkelman <[email protected]>
Co-authored-by: Matthieu Pizenberg <[email protected]>
This was referenced Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants