Spike: Bug compatibility in Python Protocol API versioning #7477

SyntaxColoring · 2021-03-12T16:33:35Z

The Python Protocol API has a large surface area of undefined behavior
- Partially due to lack of parameter validation
The API sometimes also has bugs
I feel it's unduly difficult to fix these bugs because of our versioning scheme: the current version of the software must try to be bug-for-bug compatible with all prior versions of the software
This requires fully characterizing the behavior of existing bugs, which is laborious and error-prone.
- Bugs are sometimes the interaction of disparate areas of software. Maintaining those kinds of bugs sort of means we have to maintain the architecture that led to them.

What can/should we do about this?

Deliverables

Document describing:

Background (what problems do we feel exist, and why are they bad)
- Concrete examples
- We want to make it easier to develop correct protocols without scrutinizing a wall-of-text run log or watching the robot move
Non-exhaustive list of directions for solutions (at a high level)
Prior art?

Purpose

Aligning, Software-wide (or at least CPX-wide), on what to do about this.

Doing nothing is a valid answer.
Policy-level solutions are a valid answer (e.g. loosen our versioning policy).
Technical solutions are a valid answer (e.g. multiple versions of the Python API installed concurrently, via virtualenvs or whatever).

Implications for Protocol Engine?

Audience

Depending on the solutions that are explored:

All robotics platform software developers
Product (i.e. @mattwelch)
Applications Engineering
Support maybe (the high-level parts)

SyntaxColoring · 2021-03-12T16:50:09Z

Probably has implications with Protocol Engine...

We suspect PE would warrant a major software version bump anyway...maybe this is an opportunity to batch up "breaking" "changes" to the Python API like bug fixes like fix(api): Error instead of infinite-looping when disposal_volume is too high to make progress #6754?
Still, our bug-compatibility versioning policy will still be a problem in PE. How do we deal with that? We can "reset ourselves" with a 4.0->5.0 version bump, but the hairs will still eventually grow as long as we have our current versioning policy. (They'll just grow in different places.)

Mitigation strategies?

Make exhaustive unhappy-path tests a hard requirement of PE?
Make very strict parameter validation a hard requirement of PE?

Other things PE should consider?

Facilitation of non-fatal warnings from PE? (supporting app: Display warning messages #3536)
- Source mapping (step or line of code)?
- Don't necessarily want it to be part of the run log

mcous · 2021-03-12T18:25:32Z

We suspect PE would warrant a major software version bump anyway... with a 4.0->5.0 version bump

To add a little clarity here in case anyone reads this and panics: (my) thinking is that it would be a v3 of specifically the Python protocol API, without dropping support for v2 (e.g. by keeping some version of non-protocol-engine execution around) and without making any fundamental changes to the API itself. So, little to no rewriting of protocols, but a bump to make sure folks safely opt-in to the new execution engine.

If we don't drop support for Python Protocol API v2, then the system software stays on the 4.x

Implications for Protocol Engine?

To put down on "paper" what's been floating around in my head:

what behaviors (could) change?

How parameters in Protocol API methods are interpreted
How the Protocol API breaks down complex actions into Protocol Engine commands
- This would be our transfer problem referenced in the original post
Payload schemas for Protocol Engine command requests and results
The behavior of individual commands during execution

how would changes be addressed?

(1) and (2) live both entirely in the Protocol API and can continue to be covered by metadata.apiVersion
- Both functionalities can be implemented as pure functions
- Therefore, regression test coverage is straightforward for all possible apiVersion values
(3) can be handled by only ever adding optional fields to these models, never removing them
- Again, at this point it becomes a pure data concern
(4) is the most complex, so it requires a little breaking down

how could we address changes in robot behavior

In the Protocol Engine world, robot behavior changes could happen for the following reasons, all contained to the command's implementation.

Underlying state stored changes
- e.g. Calibration data format changes
State selector function changes its algorithm and returns different data
- e.g. We decide to change the height at which tips are dropped
Command implementation changes how it passes data between state selectors and the hardware control API
- e.g. An optional field has been added to the command payload and the implementation does something with it
Hardware controller changes how it interacts with hardware resources
- e.g. something something G-code something

Addressing each of these:

(1) is something to be managed carefully, but is mitigated by the fact that all state is accessed through selector functions and they're heavily unit tested
(2) can be handled a few different ways, depending on the complexity of the change:
- For small changes, build the change behind an optional parameter to the function
- For bigger changes, build the change behind a piece of configuration state in the store that must be opted into
(3) Is one to do very carefully, but the whole command model actually gives us a pretty powerful "out"
- For little stuff, see if changes ca be contained to a selector (or several), where unit testing is easier
- If the behavior change is big, write a whole new command and leave the old one alone!
- Then, the problem turns into making sure the Protocol API spits out the correct command, which is easier to test
(4) is the hardest one for us to manage and is likely going to involve some heavy work in terms of shaping the Hardware API for this purpose
- The short version is to make sure that any change in Hardware API behavior is encoded into optional method parameters or some sort of easily accessibly configuration API
- This is different than our current world, where a lot of this behavior is hidden in internal Hardware API state
- We need to get into a world where we can be confident that verifying Protocol Engine <> Hardware API interactions is enough to verify robot actions are correct
- (This relies on a good Hardware API test suite, too)

SyntaxColoring · 2021-03-17T20:17:28Z

Messy thoughts.

We had a few motivations for giving Python protocols mandatory version metadata. The motivations are related to each other, but distinct.

Safely changing default behavior. For example, if touch_tip works one way today and a different way tomorrow, people's protocols shouldn't behave differently when they update their robots.
Helpful error messages when, because a protocol uses new fea a protocol can't work on an out-of-date robot. For example, if we add a new heater-shaker module...to use the new module, you need to edit the file to use a newer apiLevel.
The above, but for bugfixes instead of features.

I think (1) is uncontroversial. I'm less sure about (2) and (3).

Suppose someone writes a protocol that uses a new fancy_transfer() feature on a new robot, and then they share it with their friends who might be on old robots.

With (2) and (3), we have this:

	Protocol requests modern `apiLevel`	Protocol requests old `apiLevel`
Robot supports modern `apiLevel`	The protocol is runnable.	Specific, actionable, easy-to-resolve, high-level error: "Use an apiLevel >= 2.10 to use fancy_transfer()."
Robot only supports older `apiLevel`s	Specific, actionable, high-level error: "Update your robot software to run this protocol." This should be easy to resolve, but users might require revalidation.	Low-level error: "No such method exists." Users might need help from Support.

Without (2) and (3):

The top-right cell would change to "The protocol is runnable."
The bottom-right cell would happen more.
Things might be more maintainable for us, because we wouldn't have to carry around as much old code?

Questions:

We should check Support cases and the analytics on error messages. How many people are running into the bottom-left cell today?
Would removing (2) and (3) actually be a win for maintainability?
In this formulation, where's the line between "bug fix" and "changing default behavior"?
Prior art. How does Android handle this? How have our existing PRs handled this?

SyntaxColoring · 2021-03-18T14:48:19Z

For near-term Protocol Engine work, the part of this that's especially relevant is:

Clarifying what problems we think exist in the status quo (and making sure we all agree that those problems are problems)
Coming up with a rough vision for how Protocol Engine will address those concerns (including hypothesizing early traps to avoid)

mattwelch · 2021-06-15T15:39:37Z

Closing in favor of a smaller follow up story that @SyntaxColoring will write for next sprint.

SyntaxColoring added robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). spike Unsized research spike ticket labels Mar 12, 2021

SyntaxColoring mentioned this issue Mar 16, 2021

fix(api): Error instead of infinite-looping when disposal_volume is too high to make progress #6754

Closed

mattwelch added this to the CPX Sprint 34 milestone Jun 2, 2021

mattwelch closed this as completed Jun 15, 2021

alexjoel42 mentioned this issue Sep 13, 2021

fix(api): Skip picking up and dropping tips for 0-volume steps in a transfer #8302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Bug compatibility in Python Protocol API versioning #7477

Spike: Bug compatibility in Python Protocol API versioning #7477

SyntaxColoring commented Mar 12, 2021 •

edited

Loading

SyntaxColoring commented Mar 12, 2021 •

edited

Loading

mcous commented Mar 12, 2021

SyntaxColoring commented Mar 17, 2021 •

edited

Loading

SyntaxColoring commented Mar 18, 2021

mattwelch commented Jun 15, 2021

Spike: Bug compatibility in Python Protocol API versioning #7477

Spike: Bug compatibility in Python Protocol API versioning #7477

Comments

SyntaxColoring commented Mar 12, 2021 • edited Loading

Deliverables

Purpose

Audience

SyntaxColoring commented Mar 12, 2021 • edited Loading

mcous commented Mar 12, 2021

what behaviors (could) change?

how would changes be addressed?

how could we address changes in robot behavior

SyntaxColoring commented Mar 17, 2021 • edited Loading

SyntaxColoring commented Mar 18, 2021

mattwelch commented Jun 15, 2021

SyntaxColoring commented Mar 12, 2021 •

edited

Loading

SyntaxColoring commented Mar 12, 2021 •

edited

Loading

SyntaxColoring commented Mar 17, 2021 •

edited

Loading