Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mailbox for thread-safe data transfer, Improve SPI reliability #107

Closed

Conversation

thomasfla
Copy link
Member

@thomasfla thomasfla commented Apr 8, 2022

Description

This PR addresses:

  • A bug fix due to multiprocessing ring buffer corruption that could happen on data transfer from network packet to SPI commands packets. Instead of handcraft ring buffer, we now use RTOS mailbox for thread safe data transfer.

  • The IMU uart driver has been rewritten in the same spirit using mailboxes to pass data from ISR to task, also fixing a core panic reboot that sometime happen (probably due to a hardware bug in the ESP32)

  • It also greatly improved the SPI reliability with uDriver v2 by adding a 1us delay on the CS line before transfer, letting more time to the uDriver v2 to handle the SPI interruption.

How I Tested

Tested on test bench In ethernet and Wifi, With IMU connected, scope on SPI and IMU UART to validate timing.

Do not merge before

Hardware test from reviewer

I fulfilled the following requirements

  • All new code is formatted according to our style guide (for C++ run clang-format, for Python, run flake8 and fix all warnings).
  • All new functions/classes are documented and existing documentation is updated according to changes.
  • No commented code from testing/debugging is kept (unless there is a good reason to keep it).

@thomasfla thomasfla changed the title Usemailbox as thread-safe data transfer, Improve SPI reliability Use mailbox for thread-safe data transfer, Improve SPI reliability Apr 8, 2022
@thomasfla thomasfla requested a review from luator April 11, 2022 08:05
Copy link
Member

@luator luator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I only reviewed the last three commits, as the older one seem to be covered by the powerboard PR.

@@ -85,6 +101,7 @@ bool spi_send(int slave, uint8_t *tx_data, uint8_t *rx_data, int len) {
trans.length=8*len;

esp_err_t err = spi_device_polling_transmit(spi, &trans);

//High CS
gpio_set_level(GPIO_DEMUX_OE, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some inconsistency in the indentation (some lines using tabs, some spaces). Might be good to clean this up with some auto-formatter.

}
else
{
SPI_REG_u16(spi_tx_packet[i], SPI_COMMAND_MODE) = SPI_SWAP_DATA_TX(1<<15, 16); //In case of zero commands, keep the system enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a constant with a speaking name for the 1<<15 would make reading the code a bit easier.

Copy link
Member

@luator luator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that GitHub seems to be a bit buggy regarding filtering commits and didn't show me the last commit earlier.

I am not familiar enough with the technical details of the IMU and UART to really judge the changes but I can test it later today.

One general comment: There are magic numbers at various places in the code (e.g. when computing buffer sizes or checking specific bytes). Using constants for them would make it easier to understand their meaning when reading the code.

@luator
Copy link
Member

luator commented Apr 12, 2022

I tested the latest version using release build. IMU was working and SPI communication seems to be stable (recorded for around 1 minute with no bad CRCs).

@jviereck
Copy link
Contributor

Should this be merged?

@luator
Copy link
Member

luator commented May 16, 2022

I think this branch contains some of the commits of #98, so the latter should probably be merged first to avoid trouble with the git history.

@jviereck
Copy link
Contributor

Thinking about it: Would it be possible to make the use of the power board and the associated protocol change optional via a build flag? My understanding is, that landing the power board support will create incompatibility between the sdk and all the master boards that are flashed so far. This will cause a lot of breakage that I would like to avoid.

What do you think @thomasfla @luator ?

@thomasfla
Copy link
Member Author

This is an interesting point. However, having the support for power board optional will also be challenging to maintain later.

So far, the addition of new sensors / data fields directly impacted the length of the network datagrams exchanged between the robot and the control computer. It was the case for example with the IMU, which is optional on some setup, but occupies a data field in the protocol regardless of its presence.
It is also the case with the uDriver, which is fixed to 6 at compile time, even if your robot uses less than 12 dof.

Because of this, the users need indeed to update their SDK and FIRMWARE at the same time. To ensure this, the solution so far was to add a protocol version in the doc, firmware and sdk : 7e1786a

A more efficient way to deal with the situation would be to handshake on the different fields that the Master board needs to expose / parse. This would increase flexibility in the case of a user not asking for a sensor or command option that was not supported by the firmware he flashed in the past. More importantly, this would reduce the traffic on smaller systems which can be beneficial for wireless for example. Finally, this would open an easier support of new sensor types (For example, using lower cost IMU, being able to overwrite the RGB LED, set some configuration variables such as SPI frequency, retry, Timeout, etc.)
This looks complicated, but we introduced a change in the protocol with the INIT and ACK packets last year that makes this handshake easier to implement: https://github.com/open-dynamic-robot-initiative/master-board/blob/master/documentation/masterboard_communication.md#data-packet The challenges might be more on how to rethink all the fixed size data structures used by the firmware and SDK to parse and pack the data according to configuration unknown configuration at compile time.

This is clearly out of the scope of this PR obviously, but I took the opportunity to discuss here what I think would be a nicer long term solution.. We might want to open a feature request issue if we think this is the way to go..

To come back to this PR, I think we should merge it and advise the user to update their master board. Especially since this also include an important bug fix. I was reluctant to open the PR because of this compatibility issue, with only a partial support of the powerboard, but this is becoming quite old now.

@luator
Copy link
Member

luator commented May 17, 2022

I like the idea of a more flexible setup but I also agree with Thomas that it would be too much for now and we shouldn't block this PR for that.

To handle the breaking changes on the short term (and in general), I suggest to use releases following some form of semantic versioning. So we would make a release (using git tag and ideally also the GitHub release functionality) with a minor or patch version increase now, before we merge the changes. Any change that breaks compatibility between firmware and sdk would then be indicated by a major version increase.
This way, it is easier for users to see if they need to update the firmware after an SDK update and with the tag they can also relatively easily stick to the older version if they don't want to update directly (e.g. due to a deadline ahead).
Ideally the firmware should have a way to report its version for this (not sure if it already has), so the SDK could check the firmware version and show a proper error in case of incompatibility.

Of course this does not cover other packages which depend on updates of the SDK, so there could still be issues with those.

@jviereck
Copy link
Contributor

I like the idea about semantic versioning and we definitely should add some code (if possible) that tells the user clearly the SDK and firmware version are out of sync as a error message to the command line. Also a post to the ODRI form would be helpful.

Right now this PR has changes for the power board and the mailbox changes. I propose we cleanup this PR to have only the mailbox changes. My understanding is that these changes do not need an update to the SDK. We can land these changes first and then the power board changes. From looking at this PR this would boil down to doing a cherry-picking of 768f83f, 4706076, 4e3c911.

The reason I would like to do things this was is that we can land the mailbox improvements, which are important to stabilize experiments (we get hit with SPI timeouts quite often in New York) and give the other PR about the power board some more time until the power board is ready/used more often before landing on master.

What do you think?

I am happy to do the cherry-picking/new PR with only the mailbox changes. I don't have access to hardware anymore (I graduated). @luator , could you test a PR that has only the mailbox changes in Tübingen maybe?

Also, does anyone has an idea where the 17 minute maximum runtime error comes from at this point? We are planning to use the NYU Fingers for longer-running manipulation experiments.

@luator
Copy link
Member

luator commented May 17, 2022

(I graduated)

Oh, I missed that, congratulations!

Separating this to make it independent of the powerboard-related changes sounds good to me (assuming that it is easily possible). I have a setup here, were I could test it relatively easily.

Regarding the 17-min-issue: All I found out so far is that it fails around 17m 45s-50s after the boards are powered one (i.e. independent of when the software on the PC-side is started). My guess is that there is some time-related counter somewhere that is overflowing but this is really just a guess.

@thomasfla
Copy link
Member Author

I'm also in favour of splitting the PR in two, especially if @jviereck proposes to do it and @luator to test it ! ^^

I'm also in for the versioning. If we could also provide a binary of the firmware for every release, that would be much easier for the end user to update. > #101

For informing the user on protocol version mismatch on the PC side, we will need to slightly change the protocol to do it.
Right now, the PC sends a INIT packet:

Protocol version Session ID
2 bytes 2 bytes

The masterboard checks the protocol and only send an ACK packet if the protocol version is valid.

Session ID SPI connected
2 bytes 1 byte

We might change this packet for a larger one containing the protocol used by the firmware, but we might want to think about it to insure interoperability..

About the 17 minutes problem, let's open an issue to discuss it. > #108

@luator
Copy link
Member

luator commented May 17, 2022

Is there already a firmware version stored that can be queried from the PC? I think independent of the protocol version, this would be something nice to have, so we can check which version is currently running.

So maybe the new ACK packet could include the firmware version, the protocol version (albeit this could be derived from the firmware version, so might not be needed) and probably a flag indicating if the protocol matches or not.

@MaximilienNaveau
Copy link
Contributor

I merged #98 so someone need to clean the git here.
@thomasfla @nim65s or @luator ?

@MaximilienNaveau
Copy link
Contributor

@thomasfla you need to take a look at this PR and resolve the conflicts.
I can help if needed but it's mainly your code.

@thomasfla
Copy link
Member Author

Already implemented

@thomasfla thomasfla closed this Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants