Combining chunked+non-chunked packed_range APIs? #51

roystgnr · 2020-10-09T13:53:29Z

An idea of mine from some send/receive packed_range discussion with @loganharbour late last night, slightly edited:

What if we always do chunked sends conceptually, but:

We pack any metadata in with the buffer itself, so that we can use a Status to allocate the receive buffer and we don't have to send a size integer separately. To enable better computation+communication interleaving during nonblocking sends, we can avoid packing a total_buffer_size, and just pack a bool in each chunk that says whether subsequent chunks are expected.
If a probe for the first buffer succeeds, we just let probes for the packed receive return true and we allow "semi" non-blocking receives in the API: when a PostWaitUnpackBuffer handles the first received chunk's buffer, it checks to see if subsequent chunks are expected and it treats those with a receive-and-unpack loop (blocking as a whole, even if we use non-blocking vector receives internally to let receive and unpack steps overlap).

Then:

We get all the benefits of the existing chunked case: User code which wants to interleave communication and computation as much as possible within a single send/receive pair can set a moderate max chunk size, and it'll work the way the chunked case does now, so they can begin processing earlier chunks while later chunks are still in transit.

We improve slightly on the existing chunked case flexibility and performance: at least the first chunk of a receive can be non-blocking, and each subsequent chunk's receive can be started (non-blocking internally even if we're using the blocking user API) before the meat of the prior chunk gets unpacked, in cases where a probe for it succeeds.

We get all the benefits of the non-chunked case: User code can set a 2e9 max chunk size and it'll work the way the non-chunked case does now, except for one extra buffer entry and one extra if test. There still won't be the latency of waiting for a separate total_size message, and we'll still be able to allocate the incoming buffer based on a probed Status.

We improve slightly on the existing non-chunked case robustness: If not everything fits in 2e9 entries, and we set a 2e9 max chunk size and try to do a non-blocking receive, then in some cases we end up stupidly twiddling our thumbs while waiting for that next chunk, which sucks ... but right now what we do in that case is timpi_error_msg("Non-blocking packed range sends cannot exceed " << std::numeric_limits<int>::max() << "in size"); so we're still moving in the right direction.

We improve a lot on the flexibility of performance testing: We can experiment with using smaller max chunk sizes to see if we can get some more communication/computation overlap, and this can even be benchmarked without recompiling because the difference between the chunked and the unchunked API is now just the value of an integer variable, not two different methods with different argument lists.

Publishing because I don't have much time to play with this right now and I don't want to forget about it when that changes, but also because I'd love to know if @friedmud can poke any holes in this idea.

The text was updated successfully, but these errors were encountered:

friedmud · 2020-10-09T16:33:23Z

The boolean will work - but I don't like the blocking part. I wouldn't want my non-blocking algorithm to secretly be blocking instead... I would rather get an error.

Can we think of a way to do this completely non-blocking?

friedmud · 2020-10-09T16:37:41Z

BTW: 2e9 objects ought to be enough for anyone.... but seriously, I don't see hitting this limit.

As I was saying to @loganharbour I expect that modern MPI implementations do the "optimal" thing when sending huge messages - they do their own internal chunking that makes sense for their transport type (ethernet, infiniband, etc.). I don't think this is something we need to second guess at this point.

friedmud · 2020-10-09T16:38:59Z

Finally - if I'm sending something that huge (it would probably be more than 2GB!)... I really want that send/receive to be non-blocking! If it were secretly blocking I would be hella pissed!

roystgnr · 2020-10-09T17:01:27Z

they do their own internal chunking

Of course they do. But they then reassemble those chunks before you're allowed to see any of them, which means you can't work on an initial chunk until after the final chunk has arrived and been copied into place.

but seriously, I don't see hitting this limit.

In which case there's no problem with the proposed API here, right? Just pass INT_MAX (which we could leave as the default) if you want one chunk and you don't expect to hit that limit.

Can we think of a way to do this completely non-blocking?

I do wish. But can you at least agree with me that you'd be less pissed about the wait() being potentially partially blocking (the receive() call itself would be non-blocking) than about the current outcome the send crashing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining chunked+non-chunked packed_range APIs? #51

Combining chunked+non-chunked packed_range APIs? #51

roystgnr commented Oct 9, 2020

friedmud commented Oct 9, 2020

friedmud commented Oct 9, 2020

friedmud commented Oct 9, 2020

roystgnr commented Oct 9, 2020

Combining chunked+non-chunked packed_range APIs? #51

Combining chunked+non-chunked packed_range APIs? #51

Comments

roystgnr commented Oct 9, 2020

friedmud commented Oct 9, 2020

friedmud commented Oct 9, 2020

friedmud commented Oct 9, 2020

roystgnr commented Oct 9, 2020