Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove truediv operation #2837

Closed
wants to merge 1 commit into from
Closed

Remove truediv operation #2837

wants to merge 1 commit into from

Conversation

rdspring1
Copy link
Collaborator

@rdspring1 rdspring1 commented Aug 23, 2024

This PR removes the truediv operation, which mapped to torch.true_divide and torch.div(a, b, rounding_mode='none').

Why?

  • It is difficult to distinguish truediv from div in the Fusion IR.
  • Thunder prim.div maps to the div operator. The other types of div are implement through this prim.

csrc/ir/nodes.cpp Outdated Show resolved Hide resolved
@jacobhinkle
Copy link
Collaborator

What exactly is the semantics of our div meant to be? For ints it is truncdiv (C++ int division). If either arg is float, then it is truediv. My point is that we do not actually implement python div (floordiv), so what is the value in separately supporting truediv?

@rdspring1
Copy link
Collaborator Author

rdspring1 commented Aug 23, 2024

Nvfuser's truediv is like torch.div with rounding_mode=None

NVFUSER_DEFINE_BINARY_FLOAT_OP(truediv, Div)
BinaryOpType::op_type, v1, v2, TypePromotion::float_op_config);

Are you proposing to just remove truediv?

@jacobhinkle
Copy link
Collaborator

Are you proposing to just remove truediv?

I'm not sure what I'm proposing. You're right that in the kernel our Div will cast int inputs to float, but that's not true in our evaluator. That's probably just a bug in the evaluator, but I think that behavior is also used in scalar arithmetic, e.g. in #2818.

I guess for ultimate clarity we should have all these separate nodes/ops: truediv, truncdiv, floordiv, and ceildiv and nothing just called div.

@rdspring1
Copy link
Collaborator Author

I believe truediv isn't used by Thunder, so maybe we can remove it for clarity. For now, I removed truediv from expression evaluator.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for translating CPP Fusion to python FusionDefinition

What's the usage here? Can you add an example to show that?

I'm also not sure if we actually do the right behavior here. i.e. we promote Tensors to floating point before div, but do we do the same for scalars? I share the concern @jacobhinkle has in his comment. But I don't think we have to resolve everything in this PR. Hence the ask of an example of what the problem we want to patch.

BTW, our python binding is binding __truediv__ to div. Might want to change that as well, since I think that's also relevant here?

The division handling in nvfuser is a concerning question. We should open and track that with another issue. Does that sound good to you @jacobhinkle ?

csrc/type.cpp Outdated
@@ -606,6 +608,7 @@ static const char* binary_op_type_inline_op2string(BinaryOpType t) {
case BinaryOpType::Add:
return "+";
case BinaryOpType::Div:
case BinaryOpType::Truediv:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but / isn't truediv though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is getting scarier here. A binary op with BinaryOpType::Truediv doesn't do truediv?!

@jacobhinkle
Copy link
Collaborator

Does that sound good to you @jacobhinkle ?

💯

@rdspring1
Copy link
Collaborator Author

What's the usage here? Can you add an example to show that?

It is this series of PRs.

@rdspring1
Copy link
Collaborator Author

rdspring1 commented Aug 27, 2024

@jjsjann123 Why not get rid of truediv in the python-frontend? It isn't used in Thunder. Division semantics are a big headache.

@rdspring1
Copy link
Collaborator Author

Do we need to support truediv, truncdiv, floordiv? Thunder already handles this.

@jjsjann123
Copy link
Collaborator

What exactly is the semantics of our div meant to be? For ints it is truncdiv (C++ int division). If either arg is float, then it is truediv. My point is that we do not actually implement python div (floordiv), so what is the value in separately supporting truediv?

Looks like thunder prim is only has c++ style div and py_floordiv (which we are not claiming anyway).
I don't think it makes sense to have others in API, since we do not have correct support at this moment.

@rdspring1 rdspring1 changed the title Add truediv BinaryOpType Remove truediv operation Sep 3, 2024
@rdspring1 rdspring1 requested a review from jjsjann123 September 4, 2024 16:38
@rdspring1
Copy link
Collaborator Author

@jjsjann123 I changed this PR to remove the truediv operation.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see why you want to have a BinaryOpType::Truediv. in #2841, this is just so we have a clean 1-to-1 translation for the truediv part?

I'm not sure how big of a deal it is to have that 1-to-1 translation. We don't have to keep a truediv binary op, having it implicitly handled by floating point promotion followed by a c++ style div should be good enough.

not having truediv is somewhat an inconvenience IMHO. thunder doesn't have to use it, but we still have our python API where I can define fusion definition manually. So I tend to think maybe we want to keep it, but I don't have a strong opinion there neither.

csrc/ops/arith.h Outdated
NVF_API Val* truediv(Val* v1, Val* v2);
NVF_API TensorView* truediv(TensorView* v1, Val* v2);
NVF_API TensorView* truediv(Val* v1, TensorView* v2);
NVF_API TensorView* truediv(TensorView* v1, TensorView* v2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question, we don't think we need/should remove the c++ truediv API. Since these does float promotion, so it is a truediv in nature? Even though it's not a BinaryOpType::TrueDiv.

csrc/type.cpp Outdated
@@ -606,6 +608,7 @@ static const char* binary_op_type_inline_op2string(BinaryOpType t) {
case BinaryOpType::Add:
return "+";
case BinaryOpType::Div:
case BinaryOpType::Truediv:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is getting scarier here. A binary op with BinaryOpType::Truediv doesn't do truediv?!

@@ -1443,7 +1443,6 @@ void initNvFuserPythonBindings(PyObject* module) {
NVFUSER_PYTHON_BINDING_BINARY_OP("add", add)
NVFUSER_PYTHON_BINDING_BINARY_OP("atan2", atan2)
NVFUSER_PYTHON_BINDING_BINARY_OP("div", div)
NVFUSER_PYTHON_BINDING_BINARY_OP("truediv", truediv)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, there's also an entry for __truediv__ that is translated to div. We have a comment saying that our div is true div, which I think is a lie.

import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id1(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Int32, is_cpu=False, stride_order=[1, 0])
    T1 = fd.define_tensor(shape=[-1, -1, -1], contiguity=[True, True, True], dtype=DataType.Int32, is_cpu=False, stride_order=[2, 1, 0])
    T2 = T0 / T1
    #T2 = fd.ops.truediv(T0, T1)
    fd.add_output(T2)

with FusionDefinition() as fd:
    nvfuser_fusion_id1(fd)

t0 = torch.ones((20,), dtype=torch.int32, device='cuda:0').as_strided((10, 2), (2, 1))
t1 = torch.ones((40,), dtype=torch.int32, device='cuda:0').as_strided((2, 10, 2), (20, 2, 1))
t1 += 1

inputs = [t0.clone(), t1.clone()]

o = fd.execute(inputs)[0]

print(o)

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this isn't necessarily caused by your PR. I'm glad you are exposing these issues. ❤️

Two things that I do have a stronger opinion on:

  1. In this PR, the BinaryOpType::Truediv isn't doing true div, so that doesn't feel right.
  2. Not in your code, but our python binding has
1555   // In PyTorch, __div__ (//) and __truediv__ (/) are different.
1556   // When applied to integer-dtype arguments, they do as expected, returning
1557   // integer and float outputs, respectively. When applied to two floating-type
1558   // arguments, they return the floor of division for // and plain division for
1559   // /. When applied to mixed types, the types are promoted, so the
1560   // floating-point behavior is returned.
1561   // Our div operator matches the __truediv__ behavior, so we do not implement
1562   // __div__.
1563   NVFUSER_PYTHON_BINDING_BINARY_OP_SPECIAL("__truediv__", "div", div)

This is just totally wrong and we should fix it.

@rdspring1 rdspring1 closed this Dec 13, 2024
rdspring1 added a commit that referenced this pull request Dec 14, 2024
This PR updates the comments for `__truediv__` operator defined in
python bindings. The current comment does not reflect what the code
actually does.

Reference:
#2837 (review)
jacobhinkle pushed a commit that referenced this pull request Dec 16, 2024
This PR updates the comments for `__truediv__` operator defined in
python bindings. The current comment does not reflect what the code
actually does.

Reference:
#2837 (review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants