Error handling, how to do it? #29
Replies: 5 comments 9 replies
-
I think I disagree, I'm leaning towards exceptions though I'm open to being convinced of another path
|
Beta Was this translation helpful? Give feedback.
-
I think the accelerator point is a good one. As long as we recognize that Rust has I agree we need a reasonably-usable MPI testing framework. I don't know that distributed error handling is really in scope for the project. |
Beta Was this translation helpful? Give feedback.
-
Just putting it here for reference: mpi-forum/mpi-issues#288. There is also some conversation on error handling. Seems like people generally lean towards exceptions. |
Beta Was this translation helpful? Give feedback.
-
KokkosComm should provide an abort mechanism that maps to the communication library's abort (like I agree that exceptions are difficult (think collective destruction of windows for example, which stops exception unwinding in a collective call). I don't like having a mechanism similar to C return codes (which A mechanism similar to error callbacks (similar to custom error handlers in MPI) could be useful. The application can either throw (if it knows that the callsite can handle exceptions), abort, or force the return of an error code / set a flag to defer handling. It's the most flexible way for users but might require the development of best practices to educate users. |
Beta Was this translation helpful? Give feedback.
-
Can I ask a dumb question - does anyone have an example of an application that checks and does something upon an MPI error? The applications I'm familiar with just treat all MPI errors as fatal. Is the only real use case here some kind of resilience thing? |
Beta Was this translation helpful? Give feedback.
-
Error handling is notoriously hard to do well, especially in HPC, when dealing with multiple threads, accelerators, and, in our case, MPI ranks.
We can list here our thoughts on designing the safest API (and still efficient) API.
We must avoid exceptions (usually, C++ developers do not oppose this one!).
But then, how do you deal with errors?
int
return type, likeMPI
. I'm not too fond of this way as it is easy not to check the return value.abort
if we think the error is not recoverable. As a library, we should avoid this brutal termination.std::expected
, but unfortunately, it is C++23. For some methods, we can usestd::optional
, even if it is not strictly speaking error handling.The library must not check global coherency and never initiate more communications than the user wants (so there is no collective to ensure every rank is okay).
Beta Was this translation helpful? Give feedback.
All reactions