Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legion profiler version mismatch in nightly uploads #962

Open
manopapad opened this issue Nov 13, 2024 · 4 comments
Open

Legion profiler version mismatch in nightly uploads #962

manopapad opened this issue Nov 13, 2024 · 4 comments
Assignees

Comments

@manopapad
Copy link
Contributor

@syamajala is reporting that the latest nightly upload of the Legate profiler doesn't match the Legate profiler. I'd like us to investigate why this is the case, and make sure the packages are aligned in the future.

These are the versions that Seshu has installed:

legate                    24.09.00.dev329 cuda12_py312_g32137a65_329_gex_gpu    legate/label/gex-experimental
legate-mpi-wrapper        1.0                 hf2740f0_15    legate/label/gex-experimental
legate-profiler           24.09.00.dev329      g4ca82553a    legate/label/experimental

This the output he sees:

Calculating critical paths
Created output directory "legion_prof"
Writing level 0 with 1 tiles
Writing level 1 with 4 tiles
Writing level 2 with 16 tiles
Writing level 3 with 64 tiles
thread '<unnamed>' panicked at src/state.rs:1086:13:
assertion failed: creation_time < entry.time_range.stop.unwrap()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Rayon: detected unexpected panic; aborting
Aborted

@eddy16112 was able to view the corresponding profiles using the latest Legion master branch.

@eddy16112
Copy link

I think the issue is not version mismatch. It is related to timing skew. Here is what I have seen:

Detected timing skew! Legion Prof found 204434 messages between nodes that appear to have been sent before the (meta-)task on the creating node started (which is clearly impossible because messages can't time-travel into the future). The average skew was at least 2116.69 us. Please report this case to the Legion developers along with an accompanying Legion Prof profile and a description of the machine it was run on so we can understand why the timing skew is occuring. In the meantime you can still use this profile to performance debug but you should be aware that the relative position of boxes on different nodes might not be accurate.
Node 0 appears to be 44.649 us behind node 7 for 13206 messages with standard deviation 12.004 us.
Node 1 appears to be 3971.201 us behind node 0 for 13325 messages with standard deviation 515.666 us.
Node 1 appears to be 367.478 us behind node 2 for 5927 messages with standard deviation 50.813 us.
Node 1 appears to be 336.875 us behind node 3 for 5881 messages with standard deviation 46.882 us.
Node 1 appears to be 377.175 us behind node 4 for 5961 messages with standard deviation 59.448 us.
Node 1 appears to be 1194.308 us behind node 5 for 6084 messages with standard deviation 180.450 us.
Node 1 appears to be 1825.849 us behind node 6 for 5831 messages with standard deviation 227.081 us.
Node 1 appears to be 4081.851 us behind node 7 for 5807 messages with standard deviation 465.097 us.
Node 2 appears to be 3587.756 us behind node 0 for 12929 messages with standard deviation 467.587 us.
Node 2 appears to be 3.908 us behind node 4 for 2139 messages with standard deviation 2.307 us.
Node 2 appears to be 810.340 us behind node 5 for 5084 messages with standard deviation 129.069 us.
Node 2 appears to be 1442.449 us behind node 6 for 5381 messages with standard deviation 182.415 us.
Node 2 appears to be 3705.827 us behind node 7 for 4925 messages with standard deviation 427.631 us.
Node 3 appears to be 3600.802 us behind node 0 for 12797 messages with standard deviation 470.266 us.
Node 3 appears to be 13.180 us behind node 2 for 4218 messages with standard deviation 5.290 us.
Node 3 appears to be 25.920 us behind node 4 for 5028 messages with standard deviation 8.587 us.
Node 3 appears to be 833.804 us behind node 5 for 5257 messages with standard deviation 131.616 us.
Node 3 appears to be 1466.499 us behind node 6 for 5211 messages with standard deviation 183.283 us.
Node 3 appears to be 3702.921 us behind node 7 for 5069 messages with standard deviation 440.875 us.
Node 4 appears to be 3580.962 us behind node 0 for 13928 messages with standard deviation 455.160 us.
Node 4 appears to be 804.760 us behind node 5 for 6232 messages with standard deviation 112.328 us.
Node 4 appears to be 1431.783 us behind node 6 for 5992 messages with standard deviation 174.326 us.
Node 4 appears to be 3697.010 us behind node 7 for 5592 messages with standard deviation 412.148 us.
Node 5 appears to be 2720.753 us behind node 0 for 13041 messages with standard deviation 357.716 us.
Node 5 appears to be 575.987 us behind node 6 for 5850 messages with standard deviation 77.296 us.
Node 5 appears to be 2822.179 us behind node 7 for 5243 messages with standard deviation 328.676 us.
Node 6 appears to be 2122.396 us behind node 0 for 12889 messages with standard deviation 272.582 us.
Node 6 appears to be 2220.851 us behind node 7 for 5601 messages with standard deviation 262.320 us.

In this case, the assertion is a false alarm, but I do not know why it only happens in archive mode.

@lightsighter
Copy link

The timing skew warning will not result in an assertion. The assertion is a real problem.

@syamajala
Copy link

This issue does not seem to be there with 25.01.00.

@manopapad
Copy link
Contributor Author

@syamajala could you please provide a small test program where, if you run it with legate 24.09.00.dev329, the resulting profile cannot be read with legate-profiler 24.09.00.dev329? The @mag1cp1n can investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants