flash algorithm execution speed #1081

nerdralph · 2021-02-16T17:07:58Z

nerdralph
Feb 16, 2021

I've been trying to optimize pyOCD flash write speed with my debug probe. After configuring trace debug logs (thanks Chris!), I noticed less than half the time is spent doing block writes. Most of the time seems to be spent generating the AP read and write commands in between block transfers. Here's an excerpt from the logs:

0002482:DEBUG:ap:_write_block32:001014 }
0002482:DEBUG:ap:read_mem:001018 (ap=0x0; addr=0xe000edf0, size=32) {
0002482:DEBUG:ap:write_ap:001019 cached (ap=0x0; addr=0x00000000) = 0x23000012
0002482:DEBUG:dap:write_ap:001020 (addr=0x00000004) = 0xe000edf0
0002482:DEBUG:dap_access_cmsis_dap:get_request_space(1, 05:w)[wc=7, rc=0, ba=1->0] -> (sz=1, free=5, delta=-247)
0002482:DEBUG:dap_access_cmsis_dap:add(1, 05:w) -> [wc=8, rc=0, ba=0]
0002482:DEBUG:dap_access_cmsis_dap:get_request_space(1, 0f:r)[wc=8, rc=0, ba=0->0] -> (sz=1, free=15, delta=-246)
0002482:DEBUG:dap_access_cmsis_dap:add(1, 0f:r) -> [wc=8, rc=1, ba=0]
0002483:DEBUG:dap:read_ap:001021 (addr=0x0000000c) -> ...
0002483:DEBUG:dap_access_cmsis_dap:New _Command
0002485:DEBUG:dap:read_ap:001021 ...(addr=0x0000000c) -> 0x01030003
0002485:DEBUG:ap:read_mem:001018 (ap=0x0; addr=0xe000edf0, size=32) -> 0x01030003 }
0002485:DEBUG:ap:read_mem:001022 (ap=0x0; addr=0xe000edf0, size=32) {
0002485:DEBUG:ap:write_ap:001023 cached (ap=0x0; addr=0x00000000) = 0x23000012
0002485:DEBUG:dap:write_ap:001024 (addr=0x00000004) = 0xe000edf0
0002485:DEBUG:dap_access_cmsis_dap:get_request_space(1, 05:w)[wc=0, rc=0, ba=1->1] -> (sz=1, free=14)
0002485:DEBUG:dap_access_cmsis_dap:add(1, 05:w) -> [wc=1, rc=0, ba=1]
0002485:DEBUG:dap_access_cmsis_dap:get_request_space(1, 0f:r)[wc=1, rc=0, ba=1->0] -> (sz=1, free=15, delta=-253)
0002485:DEBUG:dap_access_cmsis_dap:add(1, 0f:r) -> [wc=1, rc=1, ba=0]
0002485:DEBUG:dap:read_ap:001025 (addr=0x0000000c) -> ...
0002486:DEBUG:dap_access_cmsis_dap:New _Command
0002488:DEBUG:dap:read_ap:001025 ...(addr=0x0000000c) -> 0x00030003
0002488:DEBUG:ap:read_mem:001022 (ap=0x0; addr=0xe000edf0, size=32) -> 0x00030003 }
0002488:DEBUG:ap:write_mem:001026 (ap=0x0; addr=0xe000edf4, size=32) = 0x00000000 {
0002488:DEBUG:ap:write_ap:001027 cached (ap=0x0; addr=0x00000000) = 0x23000012
0002488:DEBUG:dap:write_ap:001028 (addr=0x00000004) = 0xe000edf4
0002488:DEBUG:dap_access_cmsis_dap:get_request_space(1, 05:w)[wc=0, rc=0, ba=1->1] -> (sz=1, free=14)

It takes until the 2504ms timestamp to get a full packet to write. In total from the end of one write_block32 until the start of the next write_block32 is 45ms. pyOCD seems to be doing some sort of caching of the parsed flash algorithm, as the time between the 1st and 2nd block writes is about twice as long (~100ms). Is there any easy way to speed this up? The host CPU is a 3.2GHz Intel Core 64-bit.

nerdralph · 2021-02-16T19:25:25Z

nerdralph
Feb 16, 2021
Author

I did a comparison between pyOCD and openocd flashing a 20kB file to a stm32f103. pyOCD runtime was 3.9s vs 2.45s for openocd.

0 replies

flit · 2021-02-16T21:01:50Z

flit
Feb 16, 2021
Maintainer

Obviously a big part of that is simply that Python is slower than C. Are you using Python 3.9? It's noticeably faster than 2.7, or even earlier 3.x versions. Another option is pypy3.

There are probably multiple optimizations that could be implemented within pyocd, but I'm not aware of anything that's low hanging fruit. Profiling like you're doing is the best way to start.

0 replies

nerdralph · 2021-02-16T21:33:53Z

nerdralph
Feb 16, 2021
Author

My tests were with Python 3.7.9. I'll try 3.9.1 to see if it is any faster. Do you know if pyOCD works with pypy3? My experience with it was not great in the past.

I've noticed pyOCD only lightly loads 2 of 4 cores. If my python programming skills were better I'd consider modifying the AP code to use a separate thread for generating the sequence of commands required for the flash algorithm.

0 replies

nerdralph · 2021-02-16T21:49:54Z

nerdralph
Feb 16, 2021
Author

Python 3.9 doesn't run on Win7. I just tried upgrading to 3.8.7, and pyOCD is a bit faster now, taking 3.6s instead of 3.9s.

0 replies

flit · 2021-02-16T22:05:31Z

flit
Feb 16, 2021
Maintainer

I haven't tested a ton with pypy3, but I didn't run into any problems. Ymmv. (And I've had plenty of trouble with it in other cases.)

The problem with multithreading in CPython is the GIL: Global Interpreter Lock. That about says it. Performance is only improved for multithreaded CPython if your threads are mostly IO-bound, or call into C-based libraries that release the GIL while they run. The multiprocessing package makes it easy to use separate processes for performance instead of threads, but the overhead probably would make any gains disappear. …… There are benefits to using Python, but also some serious drawbacks.

One option we've talked about is to have sort of a meta-flash-algorithm, code that runs on the target to orchestrate calls to the algo. For instance, instead of having to call-and-wait for every sector to be erased, drop an array of sector addresses in RAM and call the orchestrator, which will then call the algo for each address. The question is, would it really improve performance when double-buffering is being used? Usually the flash operations are the real bottleneck.

It's been several years since I looked at flash programming performance in detail, and there may very well be areas that could be improved that I'm not thinking of.

0 replies

flit · 2021-02-17T18:38:34Z

flit
Feb 17, 2021
Maintainer

Well, that's something at least…

One thing that may improve pyocd performance is how it handles data buffers. Instead of using an efficient type like array.array or bytebuffer, it uses lists of ints. This unfortunate design choice was deeply ingrained from the beginning. I've had fixing it on my todo list for a long time (probably waiting til after dropping Python 2 would be a good idea). I wonder if it would even make a measurable difference, since for the most part the buffers that pyocd works with are quite small.

0 replies

nerdralph · 2021-02-18T01:11:45Z

nerdralph
Feb 18, 2021
Author

I also don't think the data buffer handling is a performance problem. If it were, then I think the write_block32 would be slow. It's too bad the flash algorithms weren't simpler (and therefore more efficient). The way I'd do it is have the code that is loaded into SRAM loop through a chunk of data to write to flash, and jump back to the start. I'd have the debug probe set a breakpoint at the branch instruction, and continue execution when the next chunk of data is transferred to SRAM.

As for simpler ways of getting pyOCD to run faster, I just tried nuitka with a small program that failed with pypy on Windows 7. The nuitka-compiled version works fine. I'm not sure how to get it to compile a whole package like pyOCD, but from skimming the docs, I think nuitka can do that.

0 replies

Hoohaha · 2021-02-18T02:05:23Z

Hoohaha
Feb 18, 2021

@flit pyOCD is really slow if flashing a large application. Set keep_unwritten=False it can improve the flashing performance and we used the way for NXP IMXRT series. I do not know why but it is useful.

0 replies

flit · 2021-02-18T23:09:45Z

flit
Feb 18, 2021
Maintainer

@Hoohaha That's a really good point. By default, pyocd will try to a) only erase and program sectors that change, b) not lose data in a partially-programmed sector. These behaviours are controlled by the smart_flash and keep_unwritten options, respectively. smart_flash requires reading existing flash contents. Whether it really improves performance, or slows it down, depends a lot on how you're using pyocd, i.e. writing entirely new code versus frequent updates of mostly the same code in a development cycle. I'm not sure exactly why keep_unwritten would noticeably affect performance (it's been a while since I wrote that code), but the large sector sizes for the QSPI devices used with RT series probably plays a role in what you're seeing.

@nerdralph I didn't know about nuitka, very cool! I had been thinking about using cython.

The way I'd do it is have the code that is loaded into SRAM loop through a chunk of data to write to flash, and jump back to the start. I'd have the debug probe set a breakpoint at the branch instruction, and continue execution when the next chunk of data is transferred to SRAM.

That's basically how it works. First the algo is loaded to RAM and inited. (Sectors erased first.) Then a buffer of data is copied to RAM, and the algo started running. It loops over the data, writing to flash in whatever size the flash word is (usually 4, 8, or 16 bytes, though more modern ~40nm parts have 256- or 512-byte words). When done it returns to a bkpt instruction. While the algo is running, a second buffer is being filled with the next page to write, and the buffer ping-ponging begins. On almost all devices, pyocd is finished writing the next buffer before the algo completes for the previous.

0 replies

nerdralph · 2021-02-20T13:55:38Z

nerdralph
Feb 20, 2021
Author

@flit Is there any formal documentation of the flash algorithm? I looked at the template code, so I can make some guesses:
Build an ELF that implements the FlashPrg functions:
https://github.com/pyocd/FlashAlgo/blob/master/source/template/FlashPrg.c

And maybe the same ELF needs to have the FlashDevice global symbol defined:
https://github.com/pyocd/FlashAlgo/blob/master/source/template/FlashDev.c

It's not obvious to me how the ELF binary needs to be packaged. Is there any docs for that?

The algorithm could be optimized to reduce the setup for each flash block. I think setting up the 3 arguments to the ProgramPage requires 2 write commands for each register, making 6 writes. Then the PC has to be set, and then the target has to be put into the running state. That's ~10 commands. There needs to be one or two reads to check the core state to confirm that it reached the breakpoint after programming the page. I'm not sure why there was so many reads though. In total there was two full packets and one partial packet required to initiate a page write.

If the page data and the 3 bytes for the arguments are done with block transfers, that would cut the overhead of setting up the ProgramPage arguments for each page. Instead of each page write terminating with a bkpt instruction, if it used my idea of a breakpoint set at a branch instruction, then the page write overhead is reduced to a couple read commands to confirm the last page is programmed, then a couple writes to put the target core into the running state.

1 reply

flit Feb 21, 2021
Maintainer

The closest to a formal specification is the CMSIS-Pack flash algo docs that I'm sure you've seen. But it leaves some important bits out, specifically about the ELF section requirements. And requirements around page sizes versus sizes passed to the ProgramPage API are not clear.

I wrote some docs about the ELF sections for the linker script for the RP2040 flash algo in Rust:
https://github.com/rp-rs/flash-algo/blob/1bf17aa1356633644aec6df086790877201351e8/link.x#L4-L11

The FlashDevice symbol is technically required by the specification. But in this commit flit/pyOCD@1301d69 that requirement is relaxed for pyocd, and therefore the scripts/generate_flash_blob.py tool in the pyocd repo (modified version of the one in the FlashAlgo repo).

Imo, the best approach would be to use an orchestrator as mentioned earlier, combined with your idea of a breakpoint at a branch instruction used for synching. For erasing, it would take a buffer with a list of sector addresses. For writing, a buffer of one or more pages' worth of data. Doubled buffered for both erase and write, of course. Config data such as the two buffers, their sizes, etc could be passed in to the orchestrator when set up, so you don't even need to write any values when resuming, as long as the buffers are full.

One other option for improvement is to take advantage of the value match option for DAP_Transfer. It's barely documented, though (it's on my list to try to get fixed). It's also not supported in pyocd either at the CMSIS-DAP driver layer (pyDAPAccess) or DebugProbe, but it would be nice to have for other things as well.

nerdralph · 2021-02-21T22:55:29Z

nerdralph
Feb 21, 2021
Author

I just implemented double-buffering of OUT packets on my probe firmware, and pyOCD is a little faster. Flashing ~22kB to a stm32f103 now takes 3.1s vs 3.9 when PACKET_COUNT was 1. Strangely openocd isn't any faster, despite the fact that it does support up to 3 queued DAP packets according to the source.

I may take a stab at trying to compile pyOCD with nuitka. If I'm feeling more ambitious, I may try writing a basic DAP HID flashing tool in a compiled language like go.

5 replies

flit Feb 21, 2021
Maintainer

3.1 s for Python vs 2.45 s for compiled C isn't too bad, imo.

It's weird that openocd didn't get faster.

If you want to try a compiled language, take a look at https://probe.rs/. The is an active embedded Rust community, and a lot of work being done on probe-rs.

nerdralph Feb 22, 2021
Author

3.1 s for Python vs 2.45 s for compiled C isn't too bad, imo.

It's weird that openocd didn't get faster.

I'll try it on Linux too, since I could be running into performance issues with the Windoze USB stack.

If you want to try a compiled language, take a look at https://probe.rs/. The is an active embedded Rust community, and a lot of work being done on probe-rs.

I noticed that when I was browsing around from the RP2020 flash algo code you linked to. If you'll humor a bit of an off-topic question, do you think Rust is worth learning? When it comes to embedded programming, I actually tend to enjoy assembler. Although I write more lines of code in C and a subset of C++, I usually check the assembler output so I know when the compiler is doing something stupid.

nerdralph Feb 22, 2021
Author

Same machine, running Ubuntu 18.0.4, python 3.6.9, flashing the same firmware takes 2.8-2.9s.

0002711:INFO:loader:Erased chip, programmed 23552 bytes (23 pages), skipped 0 bytes (0 pages) at 13.91 kB/s

real	0m2.864s
user	0m1.901s
sys	0m0.354s

flit Feb 26, 2021
Maintainer

Nice improvement. Do you have a comparison with openocd under Ubuntu? I'm curious if there is approximately the same delta as on Windows.

do you think Rust is worth learning?

Good question. First, I don't actually know Rust yet. I can mostly read it but not write, though I pretty much need to learn it for work.

It's still considered an open question as to whether Rust will really be "the" replacement for C, in either host or embedded environments. It certainly has some major advantages, particularly for safety/security, and the package ecosystem. It's super hot right now for security related software and security research (my domain), but is not used or supported by silicon vendors yet. There are parts of the language that I really like, and parts I'm not particularly fond of (but that's true of most languages)—YMMV.

So I'd say take a look, learn more about it, and see what you think. Regardless of whether it stands the test of time, it's a very interesting language. Learning languages with different paradigms is (imo) critical to becoming a better engineer: even if you stick with C (or asm!), you can adopt patterns from other languages to improve your designs.

nerdralph Mar 1, 2021
Author

If you want to try a compiled language, take a look at https://probe.rs/. The is an active embedded Rust community, and a lot of work being done on probe-rs.

Well, no luck so far:
https://github.com/probe-rs/cargo-flash/issues/161

I'm also unimpressed with rust. cargo-flash took several minutes to build+install. Part of that was downloading 107 crate dependencies; an order of magnitude more than openocd's dependencies. I spent a few hours reading up on the language itself, and it is unnecessarily complex. Not as bad as C++17, but they are still putting too much complexity into the language and compiler to solve problems that IMO should be done by linters and static analyzers. I've been experimenting with clang/llvm over the past few months after reading this article:
https://interrupt.memfault.com/blog/arm-cortexm-with-llvm-clang

While rust might have a chance competing with C/C++ on high-end MCUs with lots of flash and RAM, I think it has no chance in the high-volume PIC/AVR/8051 8-bit MCUs. Even on low-end ARM M0/M0+ like the STM32F030, I think rust is far from competing with ANSI C11.

nerdralph · 2021-03-01T16:55:53Z

nerdralph
Mar 1, 2021
Author

I managed to get probe.rs working with an ELF. 4.45s to flash 23 pages, so even worse than pyOCD.

2 replies

therealprof Mar 1, 2021

probe.rs has a ton of room for improvements and the defaults may not be ideal (e.g. the SWD speed might need to be bumped a bit).

nerdralph Mar 1, 2021
Author

probe.rs has a ton of room for improvements and the defaults may not be ideal (e.g. the SWD speed might need to be bumped a bit).

@therealprof SWD speed is irrelevant with my probe. It ignores the DAP_SWJ_Clock command and just bit-bangs SWD as fast as it can (~1Mbps)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash algorithm execution speed #1081

{{title}}

Replies: 12 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

flash algorithm execution speed #1081

nerdralph Feb 16, 2021

Replies: 12 comments · 8 replies

nerdralph Feb 16, 2021 Author

flit Feb 16, 2021 Maintainer

nerdralph Feb 16, 2021 Author

nerdralph Feb 16, 2021 Author

flit Feb 16, 2021 Maintainer

flit Feb 17, 2021 Maintainer

nerdralph Feb 18, 2021 Author

Hoohaha Feb 18, 2021

flit Feb 18, 2021 Maintainer

nerdralph Feb 20, 2021 Author

flit Feb 21, 2021 Maintainer

nerdralph Feb 21, 2021 Author

flit Feb 21, 2021 Maintainer

nerdralph Feb 22, 2021 Author

nerdralph Feb 22, 2021 Author

flit Feb 26, 2021 Maintainer

nerdralph Mar 1, 2021 Author

nerdralph Mar 1, 2021 Author

therealprof Mar 1, 2021

nerdralph Mar 1, 2021 Author

nerdralph
Feb 16, 2021

Replies: 12 comments 8 replies

nerdralph
Feb 16, 2021
Author

flit
Feb 16, 2021
Maintainer

nerdralph
Feb 16, 2021
Author

nerdralph
Feb 16, 2021
Author

flit
Feb 16, 2021
Maintainer

flit
Feb 17, 2021
Maintainer

nerdralph
Feb 18, 2021
Author

Hoohaha
Feb 18, 2021

flit
Feb 18, 2021
Maintainer

nerdralph
Feb 20, 2021
Author

flit Feb 21, 2021
Maintainer

nerdralph
Feb 21, 2021
Author

flit Feb 21, 2021
Maintainer

nerdralph Feb 22, 2021
Author

nerdralph Feb 22, 2021
Author

flit Feb 26, 2021
Maintainer

nerdralph Mar 1, 2021
Author

nerdralph
Mar 1, 2021
Author

nerdralph Mar 1, 2021
Author