Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Test codediff in CI #3311

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

jacobhinkle
Copy link
Collaborator

No description provided.

@jacobhinkle
Copy link
Collaborator Author

!build --diff

@jacobhinkle
Copy link
Collaborator Author

jit_codegen_diff_17_3/7 error is the following diff in kernel 5 of tests/python/test_ops.py::test_correctness_var_mean_float64

--- d88dcba8

+++ 5faefc04

@@ -1,6 +1,6 @@

-__global__ void nvfuser_N(Tensor<double, 4, 4> T0, Tensor<double, 2, 2> T8, Tensor<double, 2, 2> T7) {
+__global__ void nvfuser_N(Tensor<double, 4, 4> T0, Tensor<double, 2, 2> T7, Tensor<double, 2, 2> T8) {
   alignas(16) extern __shared__ char array[];
   void* shared_mem = array;
   NVFUSER_DEFINE_MAGIC_ZERO;
   Tensor<double, 4, 4> s0;
   s0.data = T0.data;
@@ -93,11 +93,11 @@

   double T1[1];
   T1[0]
     = T9[0]
     / d28;
   if (b18) {
-    T8[i15]
+    T7[i15]
        = T1[0];
   }
   double T10[1];
   broadcast::blockBroadcast<false, true, false, true>(T10[0], T1[0], static_cast<double*>(shared_mem), true);
   double T2[1];
@@ -125,9 +125,9 @@

   double T4[1];
   T4[0]
     = T2[0]
     * d26;
   if (b18) {
-    T7[i15]
+    T8[i15]
        = T4[0];
   }
 }

This doesn't look serde-related. It might be a sign of non-determinism though. I'm investigating...

@jacobhinkle
Copy link
Collaborator Author

jacobhinkle commented Oct 30, 2024

This doesn't look serde-related. It might be a sign of non-determinism though. I'm investigating...

Actually it does seem that I can repro this locally on main. Using the script from #3312, I see the following:

rm -rf /tmp/nvfuser_kernel_db
tools/check_determinism.sh -- pytest -vs tests/python/test_ops.py::test_correctness_var_mean_float64
# This fails with a message like
# 10845c10845
# < __global__ void nvfuser_inner_persistent_f7_c1_r0_g2(Tensor<double, 1, 1> T0, Tensor<double, 0, 0> T8, Tensor<double, 0, 0> T7) {
# ---
# > __global__ void nvfuser_inner_persistent_f7_c1_r0_g2(Tensor<double, 1, 1> T0, Tensor<double, 0, 0> T7, Tensor<double, 0, 0> T8) {
# 10897c10897
# <     T8[0]
# ---
# >     T7[0]
# 10923c10923
# <     T7[0]
# ---
# >     T8[0]
# Diff of __tmp_kernel_inner_persistent_f7_c1_r0_g2.cu from rep 1 to rep 5 (above) is non-zero


rm -rf /tmp/nvfuser_kernel_db
export DEBUG_SERDE=disable
tools/check_determinism.sh -- pytest -vs tests/python/test_ops.py::test_correctness_var_mean_float64
# This succeeds

So it looks like there is non-determinism in deserialization possibly, and also we are still not properly disabling serde in the CI diff jobs. I don't understand how this can be since we have this line

export DEBUG_SERDE=disable

cc @rdspring1 @xwang233

@liqiangxl
Copy link
Collaborator

!build !test --diff-bench --diff

@jacobhinkle
Copy link
Collaborator Author

!test --diff-bench --diff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants