Replies: 12 comments 8 replies
-
I did a comparison between pyOCD and openocd flashing a 20kB file to a stm32f103. pyOCD runtime was 3.9s vs 2.45s for openocd. |
Beta Was this translation helpful? Give feedback.
-
Obviously a big part of that is simply that Python is slower than C. Are you using Python 3.9? It's noticeably faster than 2.7, or even earlier 3.x versions. Another option is pypy3. There are probably multiple optimizations that could be implemented within pyocd, but I'm not aware of anything that's low hanging fruit. Profiling like you're doing is the best way to start. |
Beta Was this translation helpful? Give feedback.
-
My tests were with Python 3.7.9. I'll try 3.9.1 to see if it is any faster. Do you know if pyOCD works with pypy3? My experience with it was not great in the past. I've noticed pyOCD only lightly loads 2 of 4 cores. If my python programming skills were better I'd consider modifying the AP code to use a separate thread for generating the sequence of commands required for the flash algorithm. |
Beta Was this translation helpful? Give feedback.
-
Python 3.9 doesn't run on Win7. I just tried upgrading to 3.8.7, and pyOCD is a bit faster now, taking 3.6s instead of 3.9s. |
Beta Was this translation helpful? Give feedback.
-
I haven't tested a ton with pypy3, but I didn't run into any problems. Ymmv. (And I've had plenty of trouble with it in other cases.) The problem with multithreading in CPython is the GIL: Global Interpreter Lock. That about says it. Performance is only improved for multithreaded CPython if your threads are mostly IO-bound, or call into C-based libraries that release the GIL while they run. The One option we've talked about is to have sort of a meta-flash-algorithm, code that runs on the target to orchestrate calls to the algo. For instance, instead of having to call-and-wait for every sector to be erased, drop an array of sector addresses in RAM and call the orchestrator, which will then call the algo for each address. The question is, would it really improve performance when double-buffering is being used? Usually the flash operations are the real bottleneck. It's been several years since I looked at flash programming performance in detail, and there may very well be areas that could be improved that I'm not thinking of. |
Beta Was this translation helpful? Give feedback.
-
Well, that's something at least… One thing that may improve pyocd performance is how it handles data buffers. Instead of using an efficient type like |
Beta Was this translation helpful? Give feedback.
-
I also don't think the data buffer handling is a performance problem. If it were, then I think the write_block32 would be slow. It's too bad the flash algorithms weren't simpler (and therefore more efficient). The way I'd do it is have the code that is loaded into SRAM loop through a chunk of data to write to flash, and jump back to the start. I'd have the debug probe set a breakpoint at the branch instruction, and continue execution when the next chunk of data is transferred to SRAM. As for simpler ways of getting pyOCD to run faster, I just tried nuitka with a small program that failed with pypy on Windows 7. The nuitka-compiled version works fine. I'm not sure how to get it to compile a whole package like pyOCD, but from skimming the docs, I think nuitka can do that. |
Beta Was this translation helpful? Give feedback.
-
@flit pyOCD is really slow if flashing a large application. Set |
Beta Was this translation helpful? Give feedback.
-
@Hoohaha That's a really good point. By default, pyocd will try to a) only erase and program sectors that change, b) not lose data in a partially-programmed sector. These behaviours are controlled by the @nerdralph I didn't know about nuitka, very cool! I had been thinking about using cython.
That's basically how it works. First the algo is loaded to RAM and inited. (Sectors erased first.) Then a buffer of data is copied to RAM, and the algo started running. It loops over the data, writing to flash in whatever size the flash word is (usually 4, 8, or 16 bytes, though more modern ~40nm parts have 256- or 512-byte words). When done it returns to a |
Beta Was this translation helpful? Give feedback.
-
@flit Is there any formal documentation of the flash algorithm? I looked at the template code, so I can make some guesses: And maybe the same ELF needs to have the FlashDevice global symbol defined: It's not obvious to me how the ELF binary needs to be packaged. Is there any docs for that? The algorithm could be optimized to reduce the setup for each flash block. I think setting up the 3 arguments to the ProgramPage requires 2 write commands for each register, making 6 writes. Then the PC has to be set, and then the target has to be put into the running state. That's ~10 commands. There needs to be one or two reads to check the core state to confirm that it reached the breakpoint after programming the page. I'm not sure why there was so many reads though. In total there was two full packets and one partial packet required to initiate a page write. If the page data and the 3 bytes for the arguments are done with block transfers, that would cut the overhead of setting up the ProgramPage arguments for each page. Instead of each page write terminating with a bkpt instruction, if it used my idea of a breakpoint set at a branch instruction, then the page write overhead is reduced to a couple read commands to confirm the last page is programmed, then a couple writes to put the target core into the running state. |
Beta Was this translation helpful? Give feedback.
-
I just implemented double-buffering of OUT packets on my probe firmware, and pyOCD is a little faster. Flashing ~22kB to a stm32f103 now takes 3.1s vs 3.9 when PACKET_COUNT was 1. Strangely openocd isn't any faster, despite the fact that it does support up to 3 queued DAP packets according to the source. I may take a stab at trying to compile pyOCD with nuitka. If I'm feeling more ambitious, I may try writing a basic DAP HID flashing tool in a compiled language like go. |
Beta Was this translation helpful? Give feedback.
-
I managed to get probe.rs working with an ELF. 4.45s to flash 23 pages, so even worse than pyOCD. |
Beta Was this translation helpful? Give feedback.
-
I've been trying to optimize pyOCD flash write speed with my debug probe. After configuring trace debug logs (thanks Chris!), I noticed less than half the time is spent doing block writes. Most of the time seems to be spent generating the AP read and write commands in between block transfers. Here's an excerpt from the logs:
It takes until the 2504ms timestamp to get a full packet to write. In total from the end of one write_block32 until the start of the next write_block32 is 45ms. pyOCD seems to be doing some sort of caching of the parsed flash algorithm, as the time between the 1st and 2nd block writes is about twice as long (~100ms). Is there any easy way to speed this up? The host CPU is a 3.2GHz Intel Core 64-bit.
Beta Was this translation helpful? Give feedback.
All reactions