-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New circular buffer/memory model #244
Comments
I agree with the need for most of these features. Regarding elimination of copies, streaming to disk, and streaming to other destinations: I think the best way is to allow the application to do the allocation of buffers. Then a Java direct byte buffer or NumPy array can be used as the destination when the image is copied from the camera. This means that file saving can be written in Java or Python, which seems like a much more practical and flexible way than having C++ code writing to a file (which will be a maintenance burden; we should avoid coding in C++ when it's an option). In this scheme, popping an image from the Core just means returning a reference, and the Core releasing its reference to the buffer. When the application is done processing the image, it can re-queue the buffer to the Core for next use. (The core should always have a stock of buffers to use; exhausting those is a "sequence buffer overflow".) Such buffers could also be (slices of) memory-mapped files (provided that the Core preserves the order of buffers; this can be an optional setting because reusing the most recently used buffer first may offer better performance in some situations). Buffers wouldn't (normally) be re-queued in this case. If also providing a live display, the buffer for the currently displayed image can be temporarily retained by the GUI, and returned to circulation once that image is no longer being displayed. This can work even with memory-mapped file buffers without making an extra copy. In the case of small images at high frame rate, it might be necessary to bunch multiple images into each buffer to have high enough throughput (but this seems like an optional enhancement that can be added later). Letting the application allocate the buffers also means that the application can decide when to release the memory and how much memory to allocate for each camera in each acquisition, and whether to use a single pool of buffers for multiple cameras or separate pools. This all seems much better than trying to teach the Core to handle all possible arrangements. Regarding I don't know what the full constraints would be for streaming to a GPU. Presumably the most theoretically efficient way (aside from the camera talking directly to the GPU, which is probably outside of the scope of MMCore) is to copy from the camera driver's DMA buffer directly to GPU memory, but I don't think this is possible with most camera SDKs. If I'm right about that (definitely not an expert here), then the above scheme of application-allocated buffers should be the next best thing, because there will be only one extra copy (which may be impossible to eliminate anyway) and GPU-accelerated Java and Python libraries can be used. And it won't tie the Core to any specific GPU interface du jour (or worse, multiple interfaces). Another important thing for high performance is to eliminate polling. Ideally there should be no calls to |
Many helpful design principles for this in: #323 (comment) |
I’m continuing here the discussion of circular buffers from the perspective of the Data Streamer device. The kinds of physical devices I had in mind while implementing the streamer continuously produce large amounts of data and stream it to the PC. Some hardware produces data that is easy to understand and visualize (like a stream of values from an analog to digital converter) while other data will be impossible to make sense of without significant amount of processing (for example, time-tagged time-resolved photon counters that produce records of photon arrival). In order to accommodate the broadest range of hardware, I think it is best for the data streamer to rely on a simple circular buffer that does not make any assumptions about what is contained in it. Another feature of this data is that data transfers may not be of the same size. Therefore, on the PC side, the device adapter should first ask the hardware for the expected size of the data transfer, allocate that amount of memory, request the transfer of data, and note if it received less data than expected. Here are some additional thoughts based on the discussion above:
|
This makes sense. Though I think this only really matters for Java, because in python you can use ctypes to wrap existing native memory in the same process (whereas I don't think this is true in Java, though I may be wrong about this).
Agree.
Yes.
Related to #323 (comment), I'm not clear what, if any, use case there is for using memory mapping, since it doesn't appear to speed up file saving.
Can you say more about this? When might reusing recent buffers boost performance?
Yes, essentially an alternative to
Yes, good to keep this in mind |
Thanks for the many good suggestions @artmeln !
I think this is a good principle for this buffer in general. Or, for what assumptions do exists, have them configurable enough that they can work with use cases we haven't yet thought of.
One way this could work that is consistent with the camera SDKs I've looked at would be to have the camera device adapter would be able to request a buffer of a certain size, and then signal when it is done with the block of memory. It sounds like you're advocating for making that done call have an optional parameter describing how much of the buffer was actually written to, which seems like a nice (optional) feature
I think this would have to be left flexible, though maybe when some special standardized keywords for metadata of particular device types
This is an interesting idea. @marktsuchida suggests something similar in #323 (comment). @marktsuchida, as I read it, your comment above:
is maybe advocating for something different here? It would be good to figure out the details of what configuration makes most sense here. I don't think this should be explicitly hard coded in any kind of required way though, because the vast majority of users who write their own processing display/code will not want to to it from c++. But it would be interesting to explore if buffers can be kept general enough such that one could create a second generic buffer and use it for display, without needing anything special. |
I don't remember seeing any evidence that memory-mapped files do not speed up saving. They may not be as fast as unbuffered writing (#323, but have we seen a comparison?), but they could still be useful in some situations (e.g., particular file formats implemented in Python). Also, a special case of memory-mapped files (so to speak) is shared memory. The general point is that allowing the application, not the Core, to allocate destination buffers can eliminate extra copies in several potential situations, including Java direct byte buffers. And allocation-by-Core can still be an optional mode under such a design. (You are right that Python can handle memory allocated by C++ directly and view it as an array, but there are still cases where allowing allocation on the Python side will offer greater flexibility, because it means Python code can use anything to allocate and requires nothing special to manage the buffer lifetime once handed back from the C++ side.) However, this scheme may also make the buffer design significantly more complex, and given that a separate mechanism (#323) may be available for raw performance, it may not be an absolute requirement. It depends on what use cases you want to support (and BTW writing out fairly concrete usage scenarios might be a good way to determine what constraints need to be satisfied by the design; without this, it is hard to argue for or against a particular design).
If they are small enough, they may fit in L2 or L3 cache. Even if they are not that small, reusing the same address range may decrease TLB misses. Whether this makes a meaningful difference in our use cases I do not know (and can only be known by extensive testing). It is also entirely possible that prefetching may eliminate the difference. I wouldn't worry about this too much, especially at this early stage. When it is easy to do so, it is best to maximize spatial and temporal locality. |
Yes, different solution for different problem. When using the unbuffered writing (#323; which I view as a special enhancement not used by everyone every time), the C++ layer needs to dictate memory allocation and realistically the images for display (which should be a small fraction of total data) will probably need to be copied once to decouple from saving. (Theoretically no copy is required, but I suspect it will be complicated to make that work.) In this case, we need at least 2 buffers, one for saving and one for sending images to the app. The latter need only contain a sampling of the images, and the app can choose not to enable it (e.g., if not displaying images in real time). In contrast, here in this issue we are talking about conventional buffering, without saving in the Core. In this case, (in a typical use case) every single image is forwarded to the Python/Java app and saved there; some may be displayed. To do this without extra copies, the app (Python/Java) needs to hold onto each image buffer for the duration of saving/display, so that it is not overwritten or deallocated, and then release it when done. This "releasing" is done by returning the finished buffer back to the Core for reuse or deallocation. This is mandatory if the buffer is allocated by the Core, and optional if allocated by the app. But in most cases reusing the buffers is desirable (unless they are special unique buffers, such as memory-mapped files) because allocation is usually very time-consuming. In this scenario, you could look at it as the second buffer being implemented on the app side. This is similar to how things currently work: every image is popped from the Core and placed in a Datastore (playing the role of the second buffer); a subset is then displayed. And to tie the two scenarios together: the "second" buffer in the unbuffered writing (#323) case can be the same thing as the regular sequence buffer used for the conventional case. |
I'm inferring this from 1) no one who has done really fast saving in c++ used memory-mapping and 2) when I was making OME-Tiff/NDTiff I initially tried memory mapping, and the performance was significantly worse than just allocating direct byte buffers.
I'm definitely not against it, just unfamiliar with what these situations are. If you could provide more detail about this it would be helpful.
Yes, this all makes sense. I agree that it would be helpful to list specific key use cases.
I'm confused as to how it would be optional if allocated by the app. If a camera device adapter is requesting a buffer to write memory into, how does this core know if it still has an application-allocated buffer to write into?
This and everything else in the comment make sense. It seems appealing to me that display buffers, buffers for unbuffered writing (#323), and buffers that mirror the current setup of copying into the App memory space can all be instances of the same class with different configuration options, unless there is a reason why this cannot be the case |
I opened this issue to solicit ideas and begin planning for what is going to replace the circular buffer. @nicost had you already opened an issue related to this? I did a quick search but nothing turned up.
@marktsuchida @nicost @nenada @jondaniels @bls337 @fadero @mattersoflight @ieivanov @edyoshikun @dpshepherd @carlkesselman (+ anyone else interested) would love to hear your thoughts and ideas.
A disorganized, partial list of current limitations and desired features for improvement:
Multi-camera support
Currently we use the same buffer for multiple cameras. Determining which image came from which camera is a nightmare as a result, and the image sizes/types of the cameras need to be the same. It may be possible in some cases to run multiple instances of the core, but this comes with other (device-specific) pitfalls, and requires hacks in higher-level code (i.e. acquisition engine)
Forced copies
Currently camera adapters copy into the buffer with
InsertImage
. Higher level code (Java/Python) then copies the data again to get access to it. A new buffer should ideally have the ability to handle pointers to image data, so that code can be written in Java/Python that writes to disk (in the c++) layer without unnecessary copies. This would make the API more complex and require memory management, which would ideally be hidden from most users by being abstracted away by the acquisition engineMetadata handling
Metadata is not always necessary for every image in a stream, and may impact performance in some situations (e.g. when image size relative to metadata size is small). Flexibility to turn off metadata in certain situations would be a welcome feature.
Streaming directly to special locations
I can imagine scenarios where one might want to stream data directly to a GPU for real time processing, or stream directly to file/memory-mapped area with no intermediate processing. How feasible would it be to set up a general mechanism for doing this type of thing? What are the pitfalls we might run into?
Exotic camera types
There are a variety of new types of cameras. For example:
How can we make our memory model generic enough to support things like this, without making it overly complex? Is it possible to also make something that would be compatible with a point-scanning system in the future? We have the concepts of multi-component and multi-channel cameras currently. There may be ways to generalize and improve these.
The text was updated successfully, but these errors were encountered: