-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpInfo has problems testing define_tensor. #3225
Comments
Since #3222 is merged, you can now reproduce this by doing the following: $ git checkout wjy/define
$ pytest tests/python/test_ops.py -k test_correctness_define_tensor_float32 -s
========================================================================================================================================================================================================================================= test session starts =========================================================================================================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.5.0
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /opt/pytorch/nvfuser
plugins: xdist-3.6.1, timestamper-0.0.10, hypothesis-6.112.2, cov-5.0.0, timeout-2.3.1, random-order-1.1.1, mpi-0.6, benchmark-4.0.0, shard-0.1.2, typeguard-4.3.0
collected 896 items / 895 deselected / 1 selected
Running 1 items in this shard
tests/python/test_ops.py F
============================================================================================================================================================================================================================================== FAILURES ===============================================================================================================================================================================================================================================
_______________________________________________________________________________________________________________________________________________________________________________________________________________________________ test_correctness_define_tensor_float32 ________________________________________________________________________________________________________________________________________________________________________________________________________________________________
def test():
# Ref: https://github.com/pytorch/pytorch/blob/aa8ea1d787a9d21b064b664c5344376265feea6c/torch/testing/_internal/common_utils.py#L2251-L2263
# > CUDA device side error will cause subsequence test cases to fail.
# > stop entire test suite if catches RuntimeError during torch.cuda.synchronize().
if torch.cuda.is_initialized():
try:
torch.cuda.synchronize()
except RuntimeError as rte:
pytest.exit(
"TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failure"
)
> return template(opinfo, dtype)
tests/python/opinfo_framework.py:30:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/python/test_ops.py:215: in test_correctness
return serde_test_fn(op, dtype)
tests/python/test_ops.py:206: in serde_test_fn
result = correctness_test_fn(op.reference_type, op, sample)
tests/python/test_ops.py:190: in correctness_test_fn
return torch_correctness_test_fn(_fd_fn, nvf_op, sample)
tests/python/test_ops.py:86: in torch_correctness_test_fn
nvfuser_result = fd.execute(inputs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self =
def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(shape=[1, -1], contiguity=[None, Tr...ue, None], dtype=DataType.Float, is_cpu=False, stride_order=[0, 1])
T2 = fd.ops.add(T0, T1)
fd.add_output(T2)
, inputs = [tensor([[-6.7103, 5.7013]], device='cuda:0')]
def execute(
self,
inputs,
*,
device=None,
override_user_schedule=False,
capture_debug_output=False,
print_repro=False,
profile=False,
save_repro_inputs=False,
):
"""
Executes an nvFuser set of kernels for a given Fusion
The FusionDefinition will be executed on a single CUDA device.
Typically, which device to run on is determined by the devices where
the input tensors reside. However, if the Fusion is defined such that
none of the inputs are tensors, we are not able to infer a device from
the inputs. For example, the following FusionDefinition will be unable
to unambiguously infer the device of its output:
with FusionDefinition() as fd:
tv1 = fd.ops.full([5])
fd.add_output(tv1)
In that case, we default to selecting the first CUDA
device, i.e. `torch.device("cuda:0")`. This method enables selecting an
alternative preferred device.
Args:
inputs (List[Union[Tensor, Scalar]]): A list of inputs to fusion.
Kwargs:
device (Optional[Union[int, str, torch.device]]): This is a hint to run
the Fusion on the given CUDA device. This is not typically
necessary, as the device is usually inferred from the locations
of input tensors. However, for some fusion definitions, no
tensors will be input (for example when all tensors are
generated with `full` or `uniform` ops). In these cases, we
must either tell NVFuser where to run the resulting kernel, or
let it default to 0. Note that passing this option providing
and input tensors that lie on another device is an error.
override_user_schedule (bool): For a user defined schedule,
override with auto-generated schedule (default: False)
capture_debug_output (bool): Whether to capture any printed
debugging information as a string. If True, the string can be
retrieved after execution using :meth:`get_debug_output`. If False,
then that method will return None when called.
print_repro (bool): Prints a reproduction script to stdout.
profile (bool): Captures a CUPTI based profile of a fusion.
save_repro_inputs (bool): Saves the inputs for last_repro_script() to
provide a provide a reproduction script.
Returns:
List[Tensor]
"""
self.profiled = profile
if device is not None:
if not isinstance(device, torch.device):
device = torch.device(device)
assert (
device.type == "cuda"
), "If device argument is passed it must be a CUDA device"
device = device.index
# if definition is not defined by a context manager, try a child class
if self.id() is None:
self._setup_definition()
self.definition()
self._finalize_definition()
defined_multidevice_schedule = hasattr(
self, "multidevice_schedule"
) and isinstance(self.multidevice_schedule, Callable)
defined_schedule = hasattr(self, "schedule") and isinstance(
self.schedule, Callable
)
assert not (
defined_multidevice_schedule and defined_schedule
), "I haven't tested what if both are defined. We don't plan to support this use case although it may just work."
if defined_multidevice_schedule:
# Unlike `schedule`, `multidevice_schedule` is designed for inter-device
# scheduling, The scheduling is done before concretization and therefore
# before pre-segmentation. `schedule` however assumes the FusionDefinition
# has been concretized and pre-segmented, and therefore requires
# `_setup_schedule` and `_finalize_schedule` to be called before and after.
#
# Note: there's a plan to embed multidevice schedules into FusionDefinition
# as annotating nodes. This may eventually replace `multidevice_schedule`.
self.multidevice_schedule()
# If schedule is defined by child class and schedule is not defined for
# inputs, make a schedule.
if defined_schedule:
# Schedule fusion if it does not exist yet or profiling fusion
if profile or not self._exist_schedule(inputs):
self._setup_schedule(inputs, overwrite_existing_schedule=profile)
self.schedule()
self._finalize_schedule(inputs)
if save_repro_inputs:
from torch._subclasses.fake_tensor import FakeTensorMode
fake_mode = FakeTensorMode()
self.fake_inputs = [fake_mode.from_tensor(inp) for inp in inputs]
results = None
try:
> results = self._execute(
inputs,
device=device,
override_user_schedule=override_user_schedule,
capture_debug_output=capture_debug_output,
profile=profile,
)
E RuntimeError: INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/executor_utils.cpp":708, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. KernelArgumentHolder contains less argument than kernel's input.
E Exception raised from bindInputs at /opt/pytorch/nvfuser/csrc/runtime/executor_utils.cpp:708 (most recent call first):
E frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7ff7946f48e7 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x53 (0x7ff794aac533 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #2: nvfuser::executor_utils::bindInputs(nvfuser::KernelArgumentHolder const&, nvfuser::Fusion*) + 0xb3a (0x7ff794d8fb3a in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #3: <unknown function> + 0x7f41cc (0x7ff794da91cc in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #4: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa9 (0x7ff794daaa39 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #5: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, std::optional<signed char>, bool, bool, bool) const + 0x796 (0x7ff794f195a6 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #6: <unknown function> + 0x1cc00e (0x7ff79478100e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #7: <unknown function> + 0x24a21f (0x7ff7947ff21f in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #8: <unknown function> + 0x2df550 (0x7ff794894550 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
E frame #9: <unknown function> + 0x15cb2e (0x57fe59be9b2e in /usr/bin/python3)
E frame #10: _PyObject_MakeTpCall + 0x25b (0x57fe59be02db in /usr/bin/python3)
E frame #11: <unknown function> + 0x16b55b (0x57fe59bf855b in /usr/bin/python3)
E frame #12: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
E frame #13: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #14: _PyEval_EvalFrameDefault + 0x8ab (0x57fe59bd2abb in /usr/bin/python3)
E frame #15: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #16: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E frame #17: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #18: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E frame #19: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #20: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E frame #21: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #22: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E frame #23: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #24: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E frame #25: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #26: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E frame #27: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #28: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E frame #29: <unknown function> + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
E frame #30: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E frame #31: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #32: _PyObject_FastCallDictTstate + 0x16d (0x57fe59bdf51d in /usr/bin/python3)
E frame #33: _PyObject_Call_Prepend + 0x5c (0x57fe59bf52bc in /usr/bin/python3)
E frame #34: <unknown function> + 0x2826d0 (0x57fe59d0f6d0 in /usr/bin/python3)
E frame #35: _PyObject_MakeTpCall + 0x25b (0x57fe59be02db in /usr/bin/python3)
E frame #36: _PyEval_EvalFrameDefault + 0x72ea (0x57fe59bd94fa in /usr/bin/python3)
E frame #37: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #38: _PyEval_EvalFrameDefault + 0x8ab (0x57fe59bd2abb in /usr/bin/python3)
E frame #39: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #40: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E frame #41: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #42: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E frame #43: <unknown function> + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
E frame #44: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E frame #45: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #46: _PyObject_FastCallDictTstate + 0x16d (0x57fe59bdf51d in /usr/bin/python3)
E frame #47: _PyObject_Call_Prepend + 0x5c (0x57fe59bf52bc in /usr/bin/python3)
E frame #48: <unknown function> + 0x2826d0 (0x57fe59d0f6d0 in /usr/bin/python3)
E frame #49: PyObject_Call + 0xbb (0x57fe59bf8ebb in /usr/bin/python3)
E frame #50: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E frame #51: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #52: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E frame #53: <unknown function> + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
E frame #54: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
E frame #55: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #56: _PyEval_EvalFrameDefault + 0x6bc (0x57fe59bd28cc in /usr/bin/python3)
E frame #57: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #58: _PyEval_EvalFrameDefault + 0x1983 (0x57fe59bd3b93 in /usr/bin/python3)
E frame #59: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #60: _PyEval_EvalFrameDefault + 0x285e (0x57fe59bd4a6e in /usr/bin/python3)
E frame #61: _PyFunction_Vectorcall + 0x7c (0x57fe59bea42c in /usr/bin/python3)
E frame #62: _PyEval_EvalFrameDefault + 0x613a (0x57fe59bd834a in /usr/bin/python3)
E frame #63: <unknown function> + 0x16b281 (0x57fe59bf8281 in /usr/bin/python3)
nvfuser/__init__.py:181: RuntimeError
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Captured log call ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ERROR nvfuser:__init__.py:192 An error occurred while executing nvFuser FusionDefinition 0.
If you believe this is a bug or need assistance, please file an issue at https://github.com/NVIDIA/Fuser/issues/new
Here's a script to reproduce the error:
```python
# CUDA devices:
# 0: NVIDIA RTX 6000 Ada Generation
# 1: NVIDIA RTX 6000 Ada Generation
# torch version: 2.6.0a0+git0eba7e5
# cuda version: 12.6
# nvfuser version: 0.2.15+gitf01caf7
import torch
from nvfuser import FusionDefinition, DataType
def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0])
T1 = fd.define_tensor(shape=[1, 2], contiguity=[True, None], dtype=DataType.Float, is_cpu=False, stride_order=[0, 1])
T2 = fd.ops.add(T0, T1)
fd.add_output(T2)
with FusionDefinition() as fd:
nvfuser_fusion_id0(fd)
inputs = [
torch.testing.make_tensor((1, 2), dtype=torch.float32, device='cuda:0'),
]
fd.execute(inputs) Traceback (most recent call last):
|
I am not sure what the problems are from the description? Was some new testing attempted for |
#3225 (comment) has an updated repro. So far, test_ops.py has been testing ops.define_tensor only for invalid cases to see if it throws the right error/exception. When I attempted to test ops.define_tensor for valid cases, I didn't manage to find a way to get the "generator" to work. That being said, I'm unsure about the root cause and the generator could be fixed trivially. |
Just by look at the failed definition, the fusion expects two input tensors but only receives one tensor. def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(shape=[1, -1], contiguity=[None, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0])
T1 = fd.define_tensor(shape=[1, 2], contiguity=[True, None], dtype=DataType.Float, is_cpu=False, stride_order=[0, 1])
T2 = fd.ops.add(T0, T1)
fd.add_output(T2)
with FusionDefinition() as fd:
nvfuser_fusion_id0(fd)
inputs = [
torch.testing.make_tensor((1, 2), dtype=torch.float32, device='cuda:0'),
]
`` |
Jingyue is trying to extend the test to valid test cases. Came across this during multi-gpu testing. |
The |
Context: https://github.com/NVIDIA/Fuser/pull/3222/files#diff-577ed6d3703dbc615028823a5113fdef10881ffb1247b9a79c7f17270650124fR11-R14
To repro, patch b0ccb48 and run
cc @jjsjann123 and @rdspring1
The text was updated successfully, but these errors were encountered: