Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual Boot can easily brick the chip. #340

Open
SpenceKonde opened this issue Apr 2, 2022 · 21 comments
Open

Virtual Boot can easily brick the chip. #340

SpenceKonde opened this issue Apr 2, 2022 · 21 comments

Comments

@SpenceKonde
Copy link
Contributor

I belive I communicated this serious defect to @WestfW in passing, but I realized it's not listed here. This has caused intense pain for someone (cost me thousands of dollars by losing me a very lucrative client, and i'm sure it cost him much more than that, because he had thousands of vulnerable devices in the field, and he'd lose a substantial fraction if them every season.)

This issue keeps coming up in the context of ATTinyCore, whch despite it's age is still extremely popular. Almost every ATTinyCore part's optiboot is impacted by this bug. Almost everyone working with an impacted configuration has had this happen to them at least once. Had I know about this bug when I added the support for optiboot - I'm not sure I would have added it.

In a nutshell It is possible to end up in a situation where neither the app nor the bootloader is capable of functioning and the bootloader is no longer run nor is a working app present. This occurs when a programming command on the first page is initiated, but for some unknown reason, fails to complete. The problem is that Optiboot with Virtual boot proceeds like normal optiboot, but this is completely inappropriate behavior. Why?

Consider when an interruption happens (brownout, reset, etc). This will most likely happen where? In the first page, because there's the highest likelihood that Vcc isn't settled or that the battery voltage plunges under the slight increase in current. Or a bungled handlining of the start of the upload (ex, another twiddle of reset after the erase starts). The most likely corrupted page will be the first page, the one where the vectors are both stored. What happens when this occurs?
Now - thje bootloader will never run, because the vector pointing to it is gone. The app will no longer run, because the vector to it is likewise gone. Hence execution skids along empty flash until the end of the erased section and then starts executing code from the middle of the sketch, with unpredictable but never correct results. The board is bricked from the persective of someone who can't reburn bootloader easily.

There is a very simple solution, but somewhat tricky to implement.
The application vector must be loacated in the last word of the last page of application flash, not somewhere else in the first page or two.
Whenever the first erase command is sent, instead of that one page, insead, all pages not earmarked for the bootlootloader must ALL be erased in reverse order, so the first thing to go is the page with the application vector relocated to, and the first thing last thing is the page with the reset vector.
Then, we proceed to write pages normally, with the reset vector rewritten so it's pointing to start of bootloader, Finally at the end (last page if we completely filled the flash, otherwise skip a bunch of middle pages to get to the last page, then write a jmp or rjmp to the application vector in the last byte of that flash.
This is 100% safe. An interruption before flash is done erasing will merely break the application. Since the first page erased is the one with app vector, it would simply run the bootloader so a new sketch could be uploaded.
An interruption at the point of 100% complete erase of the flash would have nothing but empty flash untll it hit the bootloader, Execution would skid along the 0xFFFF instructions (which as I understand is interpreted as a SBRS r31,7 - skip if bt in register 31 is set). This does not modify r31, hence depending on the value in that register, either every instruction is executed, or alternating SBRS and skipped SBRS. are. Either way, we ae guaranteed that hen t hts the even numbered start of the boot loader, t will not skip it, because stating from 0, with a constant r31, it is impossible that any even address would be skipped and the bootrloader starts at such an address.
Finally, there is the case of interruption while writing the flash,. In this case, the first page with bootloader vector is intact but some portion of later pages are not and the application vector is not written. In this final case, there is also no problem,. The bootloader can be reached normally, but the application, not uploaded completely, cannot,because the vector will not have been written.

This method is used by micronucleus and is rock-solid stable there, such that they have no qualms about disabling reset and relying on being able to update the bootloader with code that takes command of the reset vector itself, erases the bootloader, copies it's payload (new bootloadeer) in it's place, and then deletes itself.

So why haven't I fixed it?

In short, because I just can't figure out how to know when it's time to find the last page, what the last page is, and how to write to only those two (or 4) bytes. The version of the ATtiny Optimized codebase I am starting from is here: https://github.com/SpenceKonde/ATTinyCore/blob/v2.0.0-dev/avr/bootloaders/optiboot/source/optiboot.c - nothing terribly notable (do note the 2.0.0-dev branch. Everything in master is getting flushed in not very long)

Thanks
This is something I'd love to have a solution for - I don't even need someone else to do it for me, just to outline how to approach the task - right now the gulf between what I uinderstand and what is needed for this is such that I can't see it from my vantage point.. I would love to be able to drop my warnings against use of optiboot in critical use cases.

@WestfW
Copy link
Member

WestfW commented Apr 4, 2022

I don't think I had registered the seriousness of the problem. "Under some circumstances, an upload destroys the bootloader as well as the app, and you need to revert to ISP to recover" didn't sound awful.

I will re-examine the issue.

@mwudka
Copy link

mwudka commented Apr 9, 2022

I, too, would be interested in helping fix this. I'm also stuck on determining the start/end address of the bootloader.

Another challenge I'm unsure about: if my (very hazy and new) understanding of the protocol is correct, optiboot doesn't rely on the programmer telling it that finishing is done; it just lets the WDT cause a reset after an idle period. Since there's no reliable point at which optiboot knows for sure that programming is done, I'm not sure about when optiboot would perform the final write to the last byte in flash.

@SpenceKonde
Copy link
Contributor Author

OH! So is THAT how it communicates that it's done? Shit, no wonder I couldn't figure out how it told it to leave programming mode!

@WestfW
Copy link
Member

WestfW commented Apr 10, 2022

optiboot doesn't rely on the programmer telling it that finishing is done

It does detect "leave Progmode", at which point it shortens the WDT to run the applications "sooner":

    else if (ch == STK_LEAVE_PROGMODE) { /* 'Q' */
      // Adaboot no-wait mod
      watchdogConfig(WATCHDOG_16MS);
      verifySpace();
    }

@WestfW
Copy link
Member

WestfW commented Apr 16, 2022

I wonder whether almost everything could be fixed by using a program other than avrdude for the upload?
I've been ... annoyed ... at how "programmer" centric it is, whereas a program designed to work with a bootloader could offer more simplicity is both use and to-purpose modifications.
(Not that the bootloader shouldn't try to not brick chips, but...)

@SpenceKonde
Copy link
Contributor Author

SpenceKonde commented Apr 16, 2022

Well to optimize the bootloader size, like... you to make it very tightly optimized to the minimal protocol used. But that is irrelevant for two reasons:

  1. Avrdude is what people are going to be using
  2. This bug occurs ANY time the first page is erased and every non-bootloader page has not been erased, (unless the CRT is modified to ensure that a magic rjmp is located at the first two locations of the second page (IIRC, 0xFFFF is interpreted is skip bit 7 in working register 31, and working registers afaik are not initialized,so you can't count on the value of r31.

I don't see any other way around it. The only other popular bootloader already does it - it erases the whole non-bootloader flash area and the

Edit - to be clear I hate avrdude, for dozens of reasons beyond the scope of this thread. But don't think I can pin this one on avrdude. It is a failure in the design of the STK500 protocol itself because it wasn't meant to program through a bootloader it was meant to control an external programmer.

In any event, the to the extent that I can lay blame for the extent of the problem you cite, avrdude being written in C. is my top vote. Without instructions to make a working build environment, thus ensuring that only the High Priests of Arduino can make changes and actually distribute them without everyone yelling and screaming "my platform isn't supported", progress is greatly impeded

@mcuee
Copy link

mcuee commented Oct 15, 2022

Interesting discussion. I think avrdude has made a lot of progress here. With the deprecation of STK500v1 HW programmer (avrdudes/avrdude#1046), stk500.c in avrude git main is only for bootloaders and things like "Arduino as ISP". If necessary, we can also create a new programmer for optiboot if the existing -c arduino is not a good fit.

@mcuee
Copy link

mcuee commented Oct 15, 2022

The above does create a regression for optiboot_x and optiboot_dx.
avrdudes/avrdude#1120 (comment)

As mentioned by @stefanrueger, if you (@WestfW and @SpenceKonde) can specify what optiboot (and optiboot_x and optiboot_dx) needs, then the issue can be fixed.

@mcuee
Copy link

mcuee commented Oct 15, 2022

In any event, the to the extent that I can lay blame for the extent of the problem you cite, avrdude being written in C. is my top vote. Without instructions to make a working build environment, thus ensuring that only the High Priests of Arduino can make changes and actually distribute them without everyone yelling and screaming "my platform isn't supported", progress is greatly impeded

If you think Arduino guys are too slow to package good Arduino binaries, you can probably add your voice to the following avrdude issue. Personally I always try to build avrdude myself (and it is rather easy under Windows using MSYS2 mingw 32/64, Homebrew under macOS, and Linux), but I also understand why people would like to have an official binary from avrdude project.

I can see that MegaCoreX from has moved to Arduino's packaging of avrdude 7.0.
https://github.com/MCUdude/MegaCoreX

The users of MegaCoreX will potentially face an issue here (not yet fixed by Arduino) but the workaround is pretty simple..
arduino/avrdude-packing#15

Of course you can still wait until Arduino formally releases their packaging of avrdude 7.0.

@mcuee
Copy link

mcuee commented Dec 19, 2022

The above does create a regression for optiboot_x and optiboot_dx. avrdudes/avrdude#1120 (comment)

As mentioned by @stefanrueger, if you (@WestfW and @SpenceKonde) can specify what optiboot (and optiboot_x and optiboot_dx) needs, then the issue can be fixed.

The issue has been fixed in avrdude git main.

@mcuee
Copy link

mcuee commented Dec 19, 2022

Of course you can still wait until Arduino formally releases their packaging of avrdude 7.0.

For avrdude 7.0 release, this should be good enough for production.
https://github.com/arduino/avrdude-packing/releases/tag/7.0-arduino.3

You can also try avrdude git main snapshot (which will lead to 7.1 release).
https://github.com/arduino/avrdude-packing/releases/tag/7.0-arduino.4-rc1

For those who want to be more on the cutting edge, you can try my snapshot here.
https://github.com/mcuee/avrdude-packing/releases/tag/snapshot_22Dec2022

@mcuee
Copy link

mcuee commented Dec 23, 2022

@SpenceKonde

I belive one potential solution is to go to urboot which has pretty solid vector bootloader support and @MCUdude and I were not able to break it.

urboot:
https://github.com/stefanrueger/urboot

It needs to work together with -c urclock in the upcoming avrdude 7.1 release.

@stefanrueger
Copy link

@mcuee Thanks for advertising urboot. However, @SpenceKonde described the cause of the problem as

| Vcc isn't settled or that the battery voltage plunges under the slight increase in current

That isn't really something that the bootloader can do a lot about, or?

It is true that when the reset vector no longer points to the bootloader that's when the bootloader bricks, but it does not matter where the application vector points to. (Only when all flash below the bootloader is 0xff does the reset vector not matter). Yes, although bootloaders can protect the reset vector (urboot does this automatically if there is space left, but also on explicit compile-time option), I see it more a responsibility of the uploader to ensure the reset vector keeps pointing to the bootloader. @WestfW seems to agree here, as he asked

I wonder whether almost everything could be fixed by using a program other than avrdude for the upload?

AVRDUDE now has a dedicated bootloader programmer: Once v7.1 is released give avrdude -c urclock a try (instead of avrdude -c arduino) or try the current git main now. That knows a bit about bootloaders, and although it is geared towards urprotocol and urboot bootloaders it tries to compensate for the peculiarities of optiboot (that it fails to carry out chip erase, that fails to protect itself being overwritten, that it returns flash when eeeprom is requested, etc) and the peculiarities of certain parts (eg, four-page-erase parts ATtiny441, ATtiny841, ATtiny1634; the uploader needs to ensure that skipping holes in the input file are four-pages, too).

When writing the AVRDUDE programmer -c urclock I had thought hard about possible scenarios when vector bootloading could fail, and Vcc dropping owing to increased strain did enter my mind. I thought about writing flash from top to bottom, and verifying before the vector table is written that flash writing works. HOWEVER, in the end I decided against that b/c, well, if someone operates an MCU in an unsafe Vcc regime borderlining brownout resets one has to recognise there is a limit to what software can do to counter that...

Finally, I am unclear about in which situation a board with a bootloader can be borderline and dipping Vcc while the uploader host is a happy bunny? Over the air programming? Inaccessible devices? Does optibbot even do OTA?

@mcuee
Copy link

mcuee commented Dec 23, 2022

When writing the AVRDUDE programmer -c urclock I had thought hard about possible scenarios when vector bootloading could fail, and Vcc dropping owing to increased strain did enter my mind. I thought about writing flash from top to bottom, and verifying before the vector table is written that flash writing works. HOWEVER, in the end I decided against that b/c, well, if someone operates an MCU in an unsafe Vcc regime borderlining brownout resets one has to recognise there is a limit to what software can do to counter that...

Fair enough.

Finally, I am unclear about in which situation a board with a bootloader can be borderline and dipping Vcc while the uploader host is a happy bunny? Over the air programming? Inaccessible devices? Does optibbot even do OTA?

It may be difficult for the device or the host to detect power/battery low case. Sometimes it is due to cost reasons that the user may not afford to have additional power fail detection circuitry. Sometimes the power fail detection may not be so reliable. Sometimes the contact may get loose. Sometimes it is due to not enough hold-up capacitors to bridge the short power interruption. So it can be all kind of reasons.

Still I understand there is so much the bootloader firmware can do when there is HW issues like the above.

What about a noisy line? I can see quite some issues here when people use RS-485 to have better noise immunity. As for OTA, it seems that people use bluetooth with optiboot. Does optiboot/urboot have retry mechanism against serial communication errors? How robust will it be?

@mcuee
Copy link

mcuee commented Dec 23, 2022

@stefanrueger

Is this a good reason to recommend HW-supported bootloaders against vector bootloader in use case like the above?

Ref:
https://github.com/stefanrueger/urboot/blob/main/howtoselect.md

@stefanrueger
Copy link

| recommend HW-supported bootloaders against vector bootloader in use case like the above

I suspect this happened on a part without hardware boot section support, so there was no choice. There are not many cases where it makes sense to use Optiboot's virtual boot feature on a part with hardware boot section support as this costs some 110 additional code bytes.

Urboot bootloaders in contrast let avrdude -c urclock do the vector patching, so do not spend additional bytes on this.

@stefanrueger
Copy link

hold-up capacitors

So, the board might not have held enough charge in caps near the MCU's Vcc lines? Good board design has a suitable cap vvv close to every Vcc line of the chip.

@mcuee
Copy link

mcuee commented Dec 23, 2022

| recommend HW-supported bootloaders against vector bootloader in use case like the above

Maybe it is good to mention that in the documentation.

I suspect this happened on a part without hardware boot section support, so there was no choice. There are not many cases where it makes sense to use Optiboot's virtual boot feature on a part with hardware boot section support as this costs some 110 additional code bytes.

Indeed ATtinyCore has to use virtual bootloader as the core supports chips without the hardware boot section.

Urboot bootloaders in contrast let avrdude -c urclock do the vector patching, so do not spend additional bytes on this.

Nice one.

@mcuee
Copy link

mcuee commented Dec 23, 2022

hold-up capacitors

So, the board might not have held enough charge in caps near the MCU's Vcc lines? Good board design has a suitable cap vvv close to every Vcc line of the chip.

Most of the properly designed board will have the necessary decoupling capacitors.

I actually mean to say different thing -- hold-up capacitors against power interruption (eg: 10ms hold-up time is a common requirement, then plus x ms for the firmware upgrade). I tend to think this is not really a hard requirement in many product categories. But then it does create issues for situations like this.

@stefanrueger
Copy link

This conversation has made me think, it's better for the urboot bootloaders to erase flash from top to bottom. And guess what, saves two bytes code too. Have just created a PR over at urboot

@mcuee
Copy link

mcuee commented Mar 13, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants