Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor cleanup in executor.cpp #2750

Merged
merged 16 commits into from
Sep 8, 2024
Merged

Minor cleanup in executor.cpp #2750

merged 16 commits into from
Sep 8, 2024

Conversation

csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Aug 2, 2024

  • Add NVF_THROW
  • Add more information in bindInputs error
  • Add IValue debug string function
  • Cleanup some stale code

Was just going through executor a bit to get oriented and tried to do a bit of cleanup.

Improves error handling when invalid static sizes are provided to nvFuser. Specifically, when finding incompatible sizes within void ExpressionEvaluator::propagateBoundValuesThroughExactMaps(Fusion* fusion) {.

Before this code would throw an error like:

C++ exception with description "known_size == this_size INTERNAL ASSERT FAILED at "/opt/pytorch/Fuser/csrc/expr_evaluator.cpp":362, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Conflicting sizes: 654, 456

Now it looks like:

When trying to propagate constant tensor sizes through the graph a conflict was found with 2 different sizes across dimensions that are expected to match.
For Producer TV: T1_g[ iS2{( (nvfuser_index_t)(i0) )}, iS3{( (nvfuser_index_t)(i1) )} ] id: iS3{( (nvfuser_index_t)(i1) )} found size: 654
For Consumer TV: T2_g[ iS4{( (nvfuser_index_t)(i0) )}, iS5{( (nvfuser_index_t)(i1) )} ] id: iS5{( (nvfuser_index_t)(i1) )} found size: 456
With producer-consumer relationship through the expression: T2_g[ iS4{( (nvfuser_index_t)(i0) )}, iS5{( (nvfuser_index_t)(i1) )} ]
= T0_g[ iS0{( (nvfuser_index_t)(i0) )}, iS1{( (nvfuser_index_t)(i1) )} ]
+ T1_g[ iS2{( (nvfuser_index_t)(i0) )}, iS3{( (nvfuser_index_t)(i1) )} ];

- Add NVF_THROW
- Add more information in bindInputs error
- Add IValue debug string function
- Cleanup some stale code
#define NVF_ERROR(cond, ...) \
if ((!(cond))) { \
nvfuser::nvfErrorFail( \
#define NVF_THROW(...) \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted a throw to be able to get some better error messages in bindInputs.

csrc/utils.cpp Show resolved Hide resolved
csrc/utils.cpp Show resolved Hide resolved
csrc/utils.cpp Show resolved Hide resolved
Comment on lines 712 to 717
ss << "When trying to run the provided host program,"
<< " there was an error with the provided input " << i
<< ". Provided input was:\n ";
ss << PolymorphicValue_functions::toString(*args[i]);
ss << "\n which does not match the expected input:\n ";
ss << inputs[i]->toString() << "\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is e.what() worth keeping in the new exception message? It has the original source code location at least, which I think helps debugging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this instance I didn't think it was particularly helpful, as it's the caller that we would typically want to point at, not the inside of expression evaluator which is the first to find it. I think we'd actually want the error to be thrown higher where bindInputs is called as I expect most instances that would fail are because of providing bad inputs.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust your judgement -- I haven't ran into enough errors to have a strong opinion.

AFAICT, ExpressionEvaluator::bind is a deep function that can fail at

"Could not evaluate metadata expression for ",
. While I agree with you that "most instances that would fail are because of providing bad inputs", why an input is bad might not be immediately clear to the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that's why I lifted the error in the common place we bind a bunch of inputs to expression evaluator provided from someplace not generated by nvFuser (developers in tests and Thunder in integration).

csrc/utils.cpp Outdated Show resolved Hide resolved
csrc/utils.cpp Outdated Show resolved Hide resolved
csrc/executor.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing the dead code!

csrc/executor.cpp Show resolved Hide resolved
std::vector<at::Tensor> outputs,
ExpressionEvaluator& expr_eval);

// TODO: args shouldn't come in a reference here because we will append the
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjsjann123 see the todo here. We don't ever use args as they're updated with the outputs. We always pass it back as an array of tensors. So it is different behavior with evaluateFusionOutputs, but that seems to be the behavior we should want.

Cleanup printing.

Co-authored-by: Jingyue Wu <[email protected]>
Comment on lines 712 to 717
ss << "When trying to run the provided host program,"
<< " there was an error with the provided input " << i
<< ". Provided input was:\n ";
ss << PolymorphicValue_functions::toString(*args[i]);
ss << "\n which does not match the expected input:\n ";
ss << inputs[i]->toString() << "\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust your judgement -- I haven't ran into enough errors to have a strong opinion.

AFAICT, ExpressionEvaluator::bind is a deep function that can fail at

"Could not evaluate metadata expression for ",
. While I agree with you that "most instances that would fail are because of providing bad inputs", why an input is bad might not be immediately clear to the user.

@csarofeen
Copy link
Collaborator Author

!build

@csarofeen
Copy link
Collaborator Author

!build

@csarofeen csarofeen requested a review from wujingyue August 28, 2024 21:21
@csarofeen
Copy link
Collaborator Author

I went back over the error handling, I could use another review please, @jjsjann123 @wujingyue. Focused on changes in expr_evaluator.cpp @wujingyue and @jjsjann123 because I touched the ops testing in python.

void handlePropagateError(
Fusion* fusion,
ExpressionEvaluator* expr_eval,
std::shared_ptr<VectorOfUniqueEntries<const IterDomain*>> id_set) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::shared_ptr<VectorOfUniqueEntries<const IterDomain*>> id_set) {
const VectorOfUniqueEntries<const IterDomain*>& id_set) {

Passing shared_ptr is slightly more expensive than passing in the underlying reference. The latter seems to suffice because id_set outlives this function.

csrc/expr_evaluator.cpp Outdated Show resolved Hide resolved
csrc/expr_evaluator.cpp Outdated Show resolved Hide resolved
csrc/expr_evaluator.cpp Outdated Show resolved Hide resolved
csrc/expr_evaluator.cpp Outdated Show resolved Hide resolved
csrc/expr_evaluator.cpp Outdated Show resolved Hide resolved
csrc/expr_evaluator.cpp Outdated Show resolved Hide resolved
// outputs to be able to send it to the kernel. For now none of the users are
// reconsuming the args, so it is okay. It isn't done now because changing it
// from a reference makes a call as runFusion({}) ambiguous, and that is used
// in some places in the codebase.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be really interesting to see how we can resolve this cleanly, but I don't know any way to do that without changing the call site to not use braced-init-list. This gives me headache: https://en.cppreference.com/w/cpp/language/overload_resolution#Implicit_conversion_sequence_in_list-initialization

I don't know anything that can help overload resolution? wondering if @zasdfgbnm knows any dark magic?

@csarofeen
Copy link
Collaborator Author

!build

tests/cpp/test_alias.cpp Outdated Show resolved Hide resolved
@csarofeen
Copy link
Collaborator Author

!build

@csarofeen csarofeen merged commit e9d27a2 into main Sep 8, 2024
35 of 36 checks passed
@csarofeen csarofeen deleted the executor_cleanup branch September 8, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants