Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Bug compatibility in Python Protocol API versioning #7477

Closed
SyntaxColoring opened this issue Mar 12, 2021 · 5 comments
Closed

Spike: Bug compatibility in Python Protocol API versioning #7477

SyntaxColoring opened this issue Mar 12, 2021 · 5 comments
Labels
robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). spike Unsized research spike ticket
Milestone

Comments

@SyntaxColoring
Copy link
Contributor

SyntaxColoring commented Mar 12, 2021

  • The Python Protocol API has a large surface area of undefined behavior
    • Partially due to lack of parameter validation
  • The API sometimes also has bugs
  • I feel it's unduly difficult to fix these bugs because of our versioning scheme: the current version of the software must try to be bug-for-bug compatible with all prior versions of the software
  • This requires fully characterizing the behavior of existing bugs, which is laborious and error-prone.
    • Bugs are sometimes the interaction of disparate areas of software. Maintaining those kinds of bugs sort of means we have to maintain the architecture that led to them.

What can/should we do about this?

Deliverables

Document describing:

  • Background (what problems do we feel exist, and why are they bad)
    • Concrete examples
    • We want to make it easier to develop correct protocols without scrutinizing a wall-of-text run log or watching the robot move
  • Non-exhaustive list of directions for solutions (at a high level)
  • Prior art?

Purpose

Aligning, Software-wide (or at least CPX-wide), on what to do about this.

  • Doing nothing is a valid answer.
  • Policy-level solutions are a valid answer (e.g. loosen our versioning policy).
  • Technical solutions are a valid answer (e.g. multiple versions of the Python API installed concurrently, via virtualenvs or whatever).

Implications for Protocol Engine?

Audience

Depending on the solutions that are explored:

  • All robotics platform software developers
  • Product (i.e. @mattwelch)
  • Applications Engineering
  • Support maybe (the high-level parts)
@SyntaxColoring SyntaxColoring added robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). spike Unsized research spike ticket labels Mar 12, 2021
@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Mar 12, 2021

Probably has implications with Protocol Engine...

  • We suspect PE would warrant a major software version bump anyway...maybe this is an opportunity to batch up "breaking" "changes" to the Python API like bug fixes like fix(api): Error instead of infinite-looping when disposal_volume is too high to make progress #6754?
  • Still, our bug-compatibility versioning policy will still be a problem in PE. How do we deal with that? We can "reset ourselves" with a 4.0->5.0 version bump, but the hairs will still eventually grow as long as we have our current versioning policy. (They'll just grow in different places.)

Mitigation strategies?

  • Make exhaustive unhappy-path tests a hard requirement of PE?
  • Make very strict parameter validation a hard requirement of PE?

Other things PE should consider?

  • Facilitation of non-fatal warnings from PE? (supporting app: Display warning messages #3536)
    • Source mapping (step or line of code)?
    • Don't necessarily want it to be part of the run log

@mcous
Copy link
Contributor

mcous commented Mar 12, 2021

We suspect PE would warrant a major software version bump anyway... with a 4.0->5.0 version bump

To add a little clarity here in case anyone reads this and panics: (my) thinking is that it would be a v3 of specifically the Python protocol API, without dropping support for v2 (e.g. by keeping some version of non-protocol-engine execution around) and without making any fundamental changes to the API itself. So, little to no rewriting of protocols, but a bump to make sure folks safely opt-in to the new execution engine.

If we don't drop support for Python Protocol API v2, then the system software stays on the 4.x

Implications for Protocol Engine?

To put down on "paper" what's been floating around in my head:

what behaviors (could) change?

  1. How parameters in Protocol API methods are interpreted
  2. How the Protocol API breaks down complex actions into Protocol Engine commands
    • This would be our transfer problem referenced in the original post
  3. Payload schemas for Protocol Engine command requests and results
  4. The behavior of individual commands during execution

how would changes be addressed?

  • (1) and (2) live both entirely in the Protocol API and can continue to be covered by metadata.apiVersion
    • Both functionalities can be implemented as pure functions
    • Therefore, regression test coverage is straightforward for all possible apiVersion values
  • (3) can be handled by only ever adding optional fields to these models, never removing them
    • Again, at this point it becomes a pure data concern
  • (4) is the most complex, so it requires a little breaking down

how could we address changes in robot behavior

In the Protocol Engine world, robot behavior changes could happen for the following reasons, all contained to the command's implementation.

  1. Underlying state stored changes
    • e.g. Calibration data format changes
  2. State selector function changes its algorithm and returns different data
    • e.g. We decide to change the height at which tips are dropped
  3. Command implementation changes how it passes data between state selectors and the hardware control API
    • e.g. An optional field has been added to the command payload and the implementation does something with it
  4. Hardware controller changes how it interacts with hardware resources
    • e.g. something something G-code something

Addressing each of these:

  • (1) is something to be managed carefully, but is mitigated by the fact that all state is accessed through selector functions and they're heavily unit tested
  • (2) can be handled a few different ways, depending on the complexity of the change:
    • For small changes, build the change behind an optional parameter to the function
    • For bigger changes, build the change behind a piece of configuration state in the store that must be opted into
  • (3) Is one to do very carefully, but the whole command model actually gives us a pretty powerful "out"
    • For little stuff, see if changes ca be contained to a selector (or several), where unit testing is easier
    • If the behavior change is big, write a whole new command and leave the old one alone!
    • Then, the problem turns into making sure the Protocol API spits out the correct command, which is easier to test
  • (4) is the hardest one for us to manage and is likely going to involve some heavy work in terms of shaping the Hardware API for this purpose
    • The short version is to make sure that any change in Hardware API behavior is encoded into optional method parameters or some sort of easily accessibly configuration API
    • This is different than our current world, where a lot of this behavior is hidden in internal Hardware API state
    • We need to get into a world where we can be confident that verifying Protocol Engine <> Hardware API interactions is enough to verify robot actions are correct
    • (This relies on a good Hardware API test suite, too)

@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Mar 17, 2021

Messy thoughts.

We had a few motivations for giving Python protocols mandatory version metadata. The motivations are related to each other, but distinct.

  1. Safely changing default behavior. For example, if touch_tip works one way today and a different way tomorrow, people's protocols shouldn't behave differently when they update their robots.
  2. Helpful error messages when, because a protocol uses new fea a protocol can't work on an out-of-date robot. For example, if we add a new heater-shaker module...to use the new module, you need to edit the file to use a newer apiLevel.
  3. The above, but for bugfixes instead of features.

I think (1) is uncontroversial. I'm less sure about (2) and (3).

Suppose someone writes a protocol that uses a new fancy_transfer() feature on a new robot, and then they share it with their friends who might be on old robots.

With (2) and (3), we have this:

Protocol requests modern apiLevel Protocol requests old apiLevel
Robot supports modern apiLevel The protocol is runnable. Specific, actionable, easy-to-resolve, high-level error: "Use an apiLevel >= 2.10 to use fancy_transfer()."
Robot only supports older apiLevels Specific, actionable, high-level error: "Update your robot software to run this protocol." This should be easy to resolve, but users might require revalidation. Low-level error: "No such method exists." Users might need help from Support.

Without (2) and (3):

  • The top-right cell would change to "The protocol is runnable."
  • The bottom-right cell would happen more.
  • Things might be more maintainable for us, because we wouldn't have to carry around as much old code?

Questions:

  • We should check Support cases and the analytics on error messages. How many people are running into the bottom-left cell today?
  • Would removing (2) and (3) actually be a win for maintainability?
  • In this formulation, where's the line between "bug fix" and "changing default behavior"?
  • Prior art. How does Android handle this? How have our existing PRs handled this?

@SyntaxColoring
Copy link
Contributor Author

For near-term Protocol Engine work, the part of this that's especially relevant is:

  • Clarifying what problems we think exist in the status quo (and making sure we all agree that those problems are problems)
  • Coming up with a rough vision for how Protocol Engine will address those concerns (including hypothesizing early traps to avoid)

@mattwelch mattwelch added this to the CPX Sprint 34 milestone Jun 2, 2021
@mattwelch
Copy link

Closing in favor of a smaller follow up story that @SyntaxColoring will write for next sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). spike Unsized research spike ticket
Projects
None yet
Development

No branches or pull requests

3 participants