update #2533

juliagmt-google · 2024-11-06T20:14:27Z

No description provided.

Summary: This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverts * pytorch/pytorch#135503 * pytorch/pytorch#135502 * pytorch/pytorch#135422 This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation. ``` import torch from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention from torch._inductor.lowering import make_pointwise, register_lowering from torch._inductor.virtualized import ops from torch.nn.attention.flex_attention import create_block_mask torch.set_default_device('cuda') flex_attention = torch.compile(flex_attention, dynamic=False) prefix_lengths = torch.arange(8) def prefix_lm(b, h, q, kv): return prefix_lengths[b] >= kv mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True) ``` X-link: pytorch/pytorch#136590 Approved by: https://github.com/Chillee Reviewed By: atalman Differential Revision: D63431470 Pulled By: anijain2305 fbshipit-source-id: 60915b30336121b845af71f423582c22a6c65c3f

Summary: Add new metric `--metric nsys` to collect nsys trace. Reviewed By: htyu Differential Revision: D63274918 fbshipit-source-id: 0536310df6290ea5f5a02d85cc0ad6d342d45dbd

Summary: #2458 Pull Request resolved: #2459 Reviewed By: xuzhao9 Differential Revision: D63476542 Pulled By: kit1980 fbshipit-source-id: 01e9db9cb03d34e82a773897417df2ccda410634

Summary: Pull Request resolved: #2473 Reviewed By: xuzhao9 Differential Revision: D63543625 Pulled By: bertmaher fbshipit-source-id: 1693e15875544bda0f5f6c69daa5597fffd80509

Summary: Pull Request resolved: #2475 Reviewed By: htyu Differential Revision: D63653081 Pulled By: xuzhao9 fbshipit-source-id: 8d840986779b6124cbccc2425c24e2b892d55ce4

Summary: We had the imports wrong for the internal port. Reviewed By: xuzhao9, adamomainz Differential Revision: D63643617 fbshipit-source-id: 04a49d419fede71d2681dedbfb55112a67cb4d55

Summary: We have an old triton internally that doesn't have the cublasLt bindings Reviewed By: adamomainz Differential Revision: D63643619 fbshipit-source-id: 39aece74b52f7747fe2100d7bb905bad49ba1fa0

Summary: X-link: facebookresearch/FBGEMM#301 X-link: pytorch/FBGEMM#3202 Printing warnings to stdout mucks up the output of various tools/benchmarks Reviewed By: xuzhao9, htyu Differential Revision: D63643615 fbshipit-source-id: 1f34508a7fd36f5aa421e11bddd5ce77fc13038a

Summary: FBGEMM has changed how it declares its Cutlass-based blockwise gemm. Reviewed By: htyu, sijiac, adamomainz Differential Revision: D63643618 fbshipit-source-id: e46e3bbd2e07be0653f7c7fa6bd080b6c8db171e

Summary: We have a big list of interesting shapes for blockwise/rowwise scaled gemm. A lot of these are variants of llama. We might want to use them for gemm and fp8_gemm (unscaled) as well, but for now just do blockwise/rowwise Reviewed By: xuzhao9, adamomainz Differential Revision: D63643616 fbshipit-source-id: 328961fe8c91e66428fcd1e5b72c89813f58a5a3

Summary: We were only benchmarking `row-major x row-major` gemms (also called `TT` or `transpose-transpose`, because FORTRAN), which is actually not the common case; `nn.Linear` will use column-major layouts for weights, which means `TN` is actually much more common. Reviewed By: adamomainz Differential Revision: D63714661 fbshipit-source-id: 735c25c59ddeb6596afd9b19f463af92036a830b

Summary: Pull Request resolved: #2483 Reviewed By: karthik-man Differential Revision: D63726031 fbshipit-source-id: dc410e503f918d83362fb38005ac4a6db5dc1e68

Summary: Right now, Tritonbench is still sharing codebase with Torchbench. Skip the Torchbench tests when the PR is on Tritonbench paths. Pull Request resolved: #2481 Reviewed By: kit1980 Differential Revision: D63695702 Pulled By: xuzhao9 fbshipit-source-id: cc88e0a987ecca1daf09d35ddeca18f07bef9077

…ging (#137139) Summary: X-link: pytorch/pytorch#137139 Approved by: https://github.com/ezyang Reviewed By: PaliC Differential Revision: D63783497 Pulled By: jovianjaison fbshipit-source-id: 5abe70d558917a9807e33be8181d42ef240c5a95

Summary: Pull Request resolved: #2484 X-link: pytorch/FBGEMM#3212 X-link: facebookresearch/FBGEMM#308 triton_rowwise persistent kernel performs poorly on MI300 compared to the non-persistent kernel, when both are run with exhaustive AMD-specific tuning. Reviewed By: htyu Differential Revision: D63741099 fbshipit-source-id: c276415ddf8f5d24ffeba70b8ee6493011b393e1

Summary: Bump transformer version to enable linger-kernels Pull Request resolved: #2488 Reviewed By: FindHao Differential Revision: D63860019 Pulled By: xuzhao9 fbshipit-source-id: f607c5553169c61270e4f5271d8375d7f227bd82

Summary: Allow users benchmark multiple ops in a single run. The ops can be split by commas, `--op fp8_gemm,addmm` Example output: ``` % python run_benchmark.py triton --op fp8_gemm,addmm --num-inputs 1 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.12s/it] x_val torch_fp8_gemm-gbps torch_fp8_gemm-gbps torch_fp8_gemm-latency torch_fp8_gemm-tflops triton_fp8_gemm-gbps triton_fp8_gemm-gbps triton_fp8_gemm-latency triton_fp8_gemm-tflops ------------------ --------------------- --------------------- ------------------------ ----------------------- ---------------------- ---------------------- ------------------------- ------------------------ (1024, 1024, 1024) 462.202 462.202 0.00907462 236.647 630.43 630.43 0.00665309 322.78 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.90s/it] (M, N, K) aten_addmm-best_config aten_addmm-gbps aten_addmm-tflops triton_addmm-best_config triton_addmm-gbps triton_addmm-tflops pt2_triton_matmul-best_config pt2_triton_matmul-gbps pt2_triton_matmul-tflops ------------------ ------------------------ ----------------- ------------------- ------------------------------------------------------------------------------------------------------------- ------------------- --------------------- ------------------------------- ------------------------ -------------------------- (20120, 512, 1536) 818.112 247.544 {'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64, 'GROUP_M': 8, 'num_warps': 8, 'num_ctas': 1, 'num_stages': 3} 911.569 275.823 889.125 269.031 ``` Pull Request resolved: #2490 Reviewed By: xuzhao9 Differential Revision: D63862548 Pulled By: FindHao fbshipit-source-id: 9d4afa6051d4191bc2e3288f59e2820627647b91

Summary: As discussed in pytorch/pytorch#136168, I'm going to migrate implementations of operator benchmarking. This PR adds different implementations for FusedLinearCrossEntropy as a starting example. Execution command: ``` python run_benchmark.py triton --op FusedLinearCrossEntropy ``` Example output: ``` x_val LMHeadCE-latency LigerLMHeadCE-latency inductor_fused_linear_cross_entropy-latency ------- ------------------ ----------------------- --------------------------------------------- 0 98.0041 389.87 95.0412 1 196.12 652.619 193.219 2 417.242 1248.75 416.725 3 824.906 2356.25 809.56 ``` Pull Request resolved: #2485 Reviewed By: xuzhao9 Differential Revision: D63859871 Pulled By: FindHao fbshipit-source-id: 4b73a2144702c1f8f3ae5ed15e76112d03f12b87

Summary: Pull Request resolved: #2489 Reviewed By: xuzhao9 Differential Revision: D63898689 Pulled By: atalman fbshipit-source-id: 3cd430911aadd5972f1393e3548ef7d52b93b661

Summary: Remove nvidia-cuda-nvcc-cu12 as not required. Install time. Pull Request resolved: #2493 Reviewed By: xuzhao9 Differential Revision: D63987509 Pulled By: atalman fbshipit-source-id: 07298ddb569da7f7c3fe22d73da72a4ceab256f5

Summary: Add a PR CI on Tritonbench that installs the latest Triton nightly package Pull Request resolved: #2494 Reviewed By: chenyang78 Differential Revision: D63998525 Pulled By: xuzhao9 fbshipit-source-id: a26633de040bdf324e9ae5c9b130ec1a58dfd409

Summary: X-link: pytorch/pytorch#137431 Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile. ghstack-source-id: 246618821 exported-using-ghexport Reviewed By: oulgen Differential Revision: D63900826 fbshipit-source-id: 3f2896287777c94344960e7cad131f71aaf0210f

Summary: This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444 X-link: pytorch/pytorch#137114 Approved by: https://github.com/yanboliang Reviewed By: jovianjaison Differential Revision: D64088005 Pulled By: mlazos fbshipit-source-id: 156b9bf38a535933f8dd966ee96ed3099d7b4be2

Summary: Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422 X-link: pytorch/pytorch#137115 Approved by: https://github.com/yanboliang ghstack dependencies: #137114 Reviewed By: jovianjaison Differential Revision: D64088016 Pulled By: mlazos fbshipit-source-id: 53efb5a6e689d4fb6112a6462851ee7e81b28c24

…s (#137119) Summary: X-link: pytorch/pytorch#137119 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227 Reviewed By: jovianjaison Differential Revision: D64088048 Pulled By: mlazos fbshipit-source-id: 34fe09f7fa6292d89a438b780852f00e042ec950

Summary: adding new configs for servicelab + logging to scuba Follow up diff coming up to add aggregates into logging (ie harmonic mean) Reviewed By: xuzhao9 Differential Revision: D64126688 fbshipit-source-id: 0c3705e82071f1399cfc53ff496d130adf237b73

Summary: #2468 Pull Request resolved: #2482 Reviewed By: xuzhao9 Differential Revision: D64139543 Pulled By: atalman fbshipit-source-id: 2d030c66d856387b6a2451b26c89fd40e79e0e53

Summary: For systems without dyno or dcgm installed and running without sudo, the `ncu_rep` metric will get stuck asking for a sudo password. This PR checks the command or service existence before disabling them to avoid getting stuck. Pull Request resolved: #2496 Reviewed By: xuzhao9 Differential Revision: D64141793 Pulled By: FindHao fbshipit-source-id: 8d52468f04e7e5a0e8d23f3562a14c83d4a5934c

Summary: instead of having 1 list of configs for OSS ci and another set for servicelab we combine both here into one common dictionary Reviewed By: danzimm Differential Revision: D64183688 fbshipit-source-id: fa47780a3bf3ba8669c6e8fd406cff5542fd06e6

Summary: put an extra `m` by mistake in one of the configs and it is breaking in OSS Reviewed By: plotfi Differential Revision: D64208508 fbshipit-source-id: f1461da0a5e883ffd4266206f5e3b737f468c3b2

anijain2305 and others added 30 commits September 26, 2024 00:50

Add nsys integration

2edf80c

Summary: Add new metric `--metric nsys` to collect nsys trace. Reviewed By: htyu Differential Revision: D63274918 fbshipit-source-id: 0536310df6290ea5f5a02d85cc0ad6d342d45dbd

Fix bug #2458 (#2459)

0f05015

Summary: #2458 Pull Request resolved: #2459 Reviewed By: xuzhao9 Differential Revision: D63476542 Pulled By: kit1980 fbshipit-source-id: 01e9db9cb03d34e82a773897417df2ccda410634

Restore FlexAttention and FlashV3 backward (#2473)

611bf70

Summary: Pull Request resolved: #2473 Reviewed By: xuzhao9 Differential Revision: D63543625 Pulled By: bertmaher fbshipit-source-id: 1693e15875544bda0f5f6c69daa5597fffd80509

Fix hardcoded shape in low_mem_dropout benchmark (#2475)

252a3b1

Summary: Pull Request resolved: #2475 Reviewed By: htyu Differential Revision: D63653081 Pulled By: xuzhao9 fbshipit-source-id: 8d840986779b6124cbccc2425c24e2b892d55ce4

Make FA3 work in fbcode

b6b67a4

Summary: We had the imports wrong for the internal port. Reviewed By: xuzhao9, adamomainz Differential Revision: D63643617 fbshipit-source-id: 04a49d419fede71d2681dedbfb55112a67cb4d55

Skip loading triton.nvidia.cublas if not found

0611c41

Summary: We have an old triton internally that doesn't have the cublasLt bindings Reviewed By: adamomainz Differential Revision: D63643619 fbshipit-source-id: 39aece74b52f7747fe2100d7bb905bad49ba1fa0

Modernize cutlass call for fp8 blockwise

2d9ab0b

Summary: FBGEMM has changed how it declares its Cutlass-based blockwise gemm. Reviewed By: htyu, sijiac, adamomainz Differential Revision: D63643618 fbshipit-source-id: e46e3bbd2e07be0653f7c7fa6bd080b6c8db171e

Enable fp8 rowwise on AMDGPU (#2483)

f2932b7

Summary: Pull Request resolved: #2483 Reviewed By: karthik-man Differential Revision: D63726031 fbshipit-source-id: dc410e503f918d83362fb38005ac4a6db5dc1e68

Bump transformer version (#2488)

12820bc

Summary: Bump transformer version to enable linger-kernels Pull Request resolved: #2488 Reviewed By: FindHao Differential Revision: D63860019 Pulled By: xuzhao9 fbshipit-source-id: f607c5553169c61270e4f5271d8375d7f227bd82

Add user release benchmark so that we can run it on pull request (#2489)

bde2401

Summary: Pull Request resolved: #2489 Reviewed By: xuzhao9 Differential Revision: D63898689 Pulled By: atalman fbshipit-source-id: 3cd430911aadd5972f1393e3548ef7d52b93b661

Install time (#2493)

eae9e50

Summary: Remove nvidia-cuda-nvcc-cu12 as not required. Install time. Pull Request resolved: #2493 Reviewed By: xuzhao9 Differential Revision: D63987509 Pulled By: atalman fbshipit-source-id: 07298ddb569da7f7c3fe22d73da72a4ceab256f5

Add Tritonbench CI (#2494)

1ac701f

Summary: Add a PR CI on Tritonbench that installs the latest Triton nightly package Pull Request resolved: #2494 Reviewed By: chenyang78 Differential Revision: D63998525 Pulled By: xuzhao9 fbshipit-source-id: a26633de040bdf324e9ae5c9b130ec1a58dfd409

adding new configs for servicelab

533d258

Summary: adding new configs for servicelab + logging to scuba Follow up diff coming up to add aggregates into logging (ie harmonic mean) Reviewed By: xuzhao9 Differential Revision: D64126688 fbshipit-source-id: 0c3705e82071f1399cfc53ff496d130adf237b73

Improve release benchmark suites with a lower value of epoch (#2482)

7742ef2

Summary: #2468 Pull Request resolved: #2482 Reviewed By: xuzhao9 Differential Revision: D64139543 Pulled By: atalman fbshipit-source-id: 2d030c66d856387b6a2451b26c89fd40e79e0e53

combining CI and servicelab configs

3a7a4fe

Summary: instead of having 1 list of configs for OSS ci and another set for servicelab we combine both here into one common dictionary Reviewed By: danzimm Differential Revision: D64183688 fbshipit-source-id: fa47780a3bf3ba8669c6e8fd406cff5542fd06e6

fixing typo in fp8_gemm

b56e2ee

Summary: put an extra `m` by mistake in one of the configs and it is breaking in OSS Reviewed By: plotfi Differential Revision: D64208508 fbshipit-source-id: f1461da0a5e883ffd4266206f5e3b737f468c3b2

facebook-github-bot temporarily deployed to docker-s3-upload November 10, 2024 18:00 — with GitHub Actions Inactive

facebook-github-bot temporarily deployed to docker-s3-upload November 10, 2024 18:02 — with GitHub Actions Inactive

facebook-github-bot had a problem deploying to docker-s3-upload November 10, 2024 18:02 — with GitHub Actions Error

facebook-github-bot temporarily deployed to docker-s3-upload November 10, 2024 18:02 — with GitHub Actions Inactive

facebook-github-bot had a problem deploying to docker-s3-upload November 10, 2024 18:05 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 10, 2024 18:06 — with GitHub Actions Failure

facebook-github-bot temporarily deployed to docker-s3-upload November 11, 2024 04:04 — with GitHub Actions Inactive

facebook-github-bot had a problem deploying to docker-s3-upload November 11, 2024 14:04 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 11, 2024 15:00 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 11, 2024 17:05 — with GitHub Actions Failure

facebook-github-bot temporarily deployed to docker-s3-upload November 11, 2024 18:00 — with GitHub Actions Inactive

facebook-github-bot temporarily deployed to docker-s3-upload November 11, 2024 18:02 — with GitHub Actions Inactive

facebook-github-bot had a problem deploying to docker-s3-upload November 11, 2024 18:02 — with GitHub Actions Error

facebook-github-bot had a problem deploying to docker-s3-upload November 11, 2024 18:05 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 11, 2024 18:07 — with GitHub Actions Failure

facebook-github-bot temporarily deployed to docker-s3-upload November 12, 2024 04:04 — with GitHub Actions Inactive

facebook-github-bot had a problem deploying to docker-s3-upload November 12, 2024 14:04 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 12, 2024 15:00 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 12, 2024 17:05 — with GitHub Actions Failure

facebook-github-bot temporarily deployed to docker-s3-upload November 12, 2024 18:00 — with GitHub Actions Inactive

facebook-github-bot had a problem deploying to docker-s3-upload November 12, 2024 18:02 — with GitHub Actions Error

facebook-github-bot had a problem deploying to docker-s3-upload November 12, 2024 18:06 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 12, 2024 18:07 — with GitHub Actions Failure

juliagmt-google closed this Nov 12, 2024

facebook-github-bot had a problem deploying to docker-s3-upload November 13, 2024 04:04 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 13, 2024 14:03 — with GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload November 13, 2024 15:00 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update #2533

update #2533

juliagmt-google commented Nov 6, 2024

update #2533

update #2533

Conversation

juliagmt-google commented Nov 6, 2024