Add DDPWrapper #2479

Summary: This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverts * pytorch/pytorch#135503 * pytorch/pytorch#135502 * pytorch/pytorch#135422 This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation. ``` import torch from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention from torch._inductor.lowering import make_pointwise, register_lowering from torch._inductor.virtualized import ops from torch.nn.attention.flex_attention import create_block_mask torch.set_default_device('cuda') flex_attention = torch.compile(flex_attention, dynamic=False) prefix_lengths = torch.arange(8) def prefix_lm(b, h, q, kv): return prefix_lengths[b] >= kv mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True) ``` X-link: pytorch/pytorch#136590 Approved by: https://github.com/Chillee Reviewed By: atalman Differential Revision: D63431470 Pulled By: anijain2305 fbshipit-source-id: 60915b30336121b845af71f423582c22a6c65c3f

Summary: Add new metric `--metric nsys` to collect nsys trace. Reviewed By: htyu Differential Revision: D63274918 fbshipit-source-id: 0536310df6290ea5f5a02d85cc0ad6d342d45dbd

Summary: pytorch#2458 Pull Request resolved: pytorch#2459 Reviewed By: xuzhao9 Differential Revision: D63476542 Pulled By: kit1980 fbshipit-source-id: 01e9db9cb03d34e82a773897417df2ccda410634

Summary: Pull Request resolved: pytorch#2473 Reviewed By: xuzhao9 Differential Revision: D63543625 Pulled By: bertmaher fbshipit-source-id: 1693e15875544bda0f5f6c69daa5597fffd80509

Summary: Pull Request resolved: pytorch#2475 Reviewed By: htyu Differential Revision: D63653081 Pulled By: xuzhao9 fbshipit-source-id: 8d840986779b6124cbccc2425c24e2b892d55ce4

Summary: We had the imports wrong for the internal port. Reviewed By: xuzhao9, adamomainz Differential Revision: D63643617 fbshipit-source-id: 04a49d419fede71d2681dedbfb55112a67cb4d55

Summary: We have an old triton internally that doesn't have the cublasLt bindings Reviewed By: adamomainz Differential Revision: D63643619 fbshipit-source-id: 39aece74b52f7747fe2100d7bb905bad49ba1fa0

Summary: X-link: facebookresearch/FBGEMM#301 X-link: pytorch/FBGEMM#3202 Printing warnings to stdout mucks up the output of various tools/benchmarks Reviewed By: xuzhao9, htyu Differential Revision: D63643615 fbshipit-source-id: 1f34508a7fd36f5aa421e11bddd5ce77fc13038a

Summary: FBGEMM has changed how it declares its Cutlass-based blockwise gemm. Reviewed By: htyu, sijiac, adamomainz Differential Revision: D63643618 fbshipit-source-id: e46e3bbd2e07be0653f7c7fa6bd080b6c8db171e

Summary: We have a big list of interesting shapes for blockwise/rowwise scaled gemm. A lot of these are variants of llama. We might want to use them for gemm and fp8_gemm (unscaled) as well, but for now just do blockwise/rowwise Reviewed By: xuzhao9, adamomainz Differential Revision: D63643616 fbshipit-source-id: 328961fe8c91e66428fcd1e5b72c89813f58a5a3

Summary: We were only benchmarking `row-major x row-major` gemms (also called `TT` or `transpose-transpose`, because FORTRAN), which is actually not the common case; `nn.Linear` will use column-major layouts for weights, which means `TN` is actually much more common. Reviewed By: adamomainz Differential Revision: D63714661 fbshipit-source-id: 735c25c59ddeb6596afd9b19f463af92036a830b

Summary: Pull Request resolved: pytorch#2483 Reviewed By: karthik-man Differential Revision: D63726031 fbshipit-source-id: dc410e503f918d83362fb38005ac4a6db5dc1e68

Summary: Right now, Tritonbench is still sharing codebase with Torchbench. Skip the Torchbench tests when the PR is on Tritonbench paths. Pull Request resolved: pytorch#2481 Reviewed By: kit1980 Differential Revision: D63695702 Pulled By: xuzhao9 fbshipit-source-id: cc88e0a987ecca1daf09d35ddeca18f07bef9077

…ging (#137139) Summary: X-link: pytorch/pytorch#137139 Approved by: https://github.com/ezyang Reviewed By: PaliC Differential Revision: D63783497 Pulled By: jovianjaison fbshipit-source-id: 5abe70d558917a9807e33be8181d42ef240c5a95

Summary: Pull Request resolved: pytorch#2484 X-link: pytorch/FBGEMM#3212 X-link: facebookresearch/FBGEMM#308 triton_rowwise persistent kernel performs poorly on MI300 compared to the non-persistent kernel, when both are run with exhaustive AMD-specific tuning. Reviewed By: htyu Differential Revision: D63741099 fbshipit-source-id: c276415ddf8f5d24ffeba70b8ee6493011b393e1

Summary: Bump transformer version to enable linger-kernels Pull Request resolved: pytorch#2488 Reviewed By: FindHao Differential Revision: D63860019 Pulled By: xuzhao9 fbshipit-source-id: f607c5553169c61270e4f5271d8375d7f227bd82

Summary: Allow users benchmark multiple ops in a single run. The ops can be split by commas, `--op fp8_gemm,addmm` Example output: ``` % python run_benchmark.py triton --op fp8_gemm,addmm --num-inputs 1 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.12s/it] x_val torch_fp8_gemm-gbps torch_fp8_gemm-gbps torch_fp8_gemm-latency torch_fp8_gemm-tflops triton_fp8_gemm-gbps triton_fp8_gemm-gbps triton_fp8_gemm-latency triton_fp8_gemm-tflops ------------------ --------------------- --------------------- ------------------------ ----------------------- ---------------------- ---------------------- ------------------------- ------------------------ (1024, 1024, 1024) 462.202 462.202 0.00907462 236.647 630.43 630.43 0.00665309 322.78 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.90s/it] (M, N, K) aten_addmm-best_config aten_addmm-gbps aten_addmm-tflops triton_addmm-best_config triton_addmm-gbps triton_addmm-tflops pt2_triton_matmul-best_config pt2_triton_matmul-gbps pt2_triton_matmul-tflops ------------------ ------------------------ ----------------- ------------------- ------------------------------------------------------------------------------------------------------------- ------------------- --------------------- ------------------------------- ------------------------ -------------------------- (20120, 512, 1536) 818.112 247.544 {'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64, 'GROUP_M': 8, 'num_warps': 8, 'num_ctas': 1, 'num_stages': 3} 911.569 275.823 889.125 269.031 ``` Pull Request resolved: pytorch#2490 Reviewed By: xuzhao9 Differential Revision: D63862548 Pulled By: FindHao fbshipit-source-id: 9d4afa6051d4191bc2e3288f59e2820627647b91

Summary: As discussed in pytorch/pytorch#136168, I'm going to migrate implementations of operator benchmarking. This PR adds different implementations for FusedLinearCrossEntropy as a starting example. Execution command: ``` python run_benchmark.py triton --op FusedLinearCrossEntropy ``` Example output: ``` x_val LMHeadCE-latency LigerLMHeadCE-latency inductor_fused_linear_cross_entropy-latency ------- ------------------ ----------------------- --------------------------------------------- 0 98.0041 389.87 95.0412 1 196.12 652.619 193.219 2 417.242 1248.75 416.725 3 824.906 2356.25 809.56 ``` Pull Request resolved: pytorch#2485 Reviewed By: xuzhao9 Differential Revision: D63859871 Pulled By: FindHao fbshipit-source-id: 4b73a2144702c1f8f3ae5ed15e76112d03f12b87

…orch#2489) Summary: Pull Request resolved: pytorch#2489 Reviewed By: xuzhao9 Differential Revision: D63898689 Pulled By: atalman fbshipit-source-id: 3cd430911aadd5972f1393e3548ef7d52b93b661

Summary: Remove nvidia-cuda-nvcc-cu12 as not required. Install time. Pull Request resolved: pytorch#2493 Reviewed By: xuzhao9 Differential Revision: D63987509 Pulled By: atalman fbshipit-source-id: 07298ddb569da7f7c3fe22d73da72a4ceab256f5

Summary: Add a PR CI on Tritonbench that installs the latest Triton nightly package Pull Request resolved: pytorch#2494 Reviewed By: chenyang78 Differential Revision: D63998525 Pulled By: xuzhao9 fbshipit-source-id: a26633de040bdf324e9ae5c9b130ec1a58dfd409

Summary: X-link: pytorch/pytorch#137431 Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile. ghstack-source-id: 246618821 exported-using-ghexport Reviewed By: oulgen Differential Revision: D63900826 fbshipit-source-id: 3f2896287777c94344960e7cad131f71aaf0210f

Summary: This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444 X-link: pytorch/pytorch#137114 Approved by: https://github.com/yanboliang Reviewed By: jovianjaison Differential Revision: D64088005 Pulled By: mlazos fbshipit-source-id: 156b9bf38a535933f8dd966ee96ed3099d7b4be2

Summary: Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422 X-link: pytorch/pytorch#137115 Approved by: https://github.com/yanboliang ghstack dependencies: #137114 Reviewed By: jovianjaison Differential Revision: D64088016 Pulled By: mlazos fbshipit-source-id: 53efb5a6e689d4fb6112a6462851ee7e81b28c24

…s (#137119) Summary: X-link: pytorch/pytorch#137119 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227 Reviewed By: jovianjaison Differential Revision: D64088048 Pulled By: mlazos fbshipit-source-id: 34fe09f7fa6292d89a438b780852f00e042ec950

Summary: adding new configs for servicelab + logging to scuba Follow up diff coming up to add aggregates into logging (ie harmonic mean) Reviewed By: xuzhao9 Differential Revision: D64126688 fbshipit-source-id: 0c3705e82071f1399cfc53ff496d130adf237b73

…#2482) Summary: pytorch#2468 Pull Request resolved: pytorch#2482 Reviewed By: xuzhao9 Differential Revision: D64139543 Pulled By: atalman fbshipit-source-id: 2d030c66d856387b6a2451b26c89fd40e79e0e53

Summary: For systems without dyno or dcgm installed and running without sudo, the `ncu_rep` metric will get stuck asking for a sudo password. This PR checks the command or service existence before disabling them to avoid getting stuck. Pull Request resolved: pytorch#2496 Reviewed By: xuzhao9 Differential Revision: D64141793 Pulled By: FindHao fbshipit-source-id: 8d52468f04e7e5a0e8d23f3562a14c83d4a5934c

Summary: instead of having 1 list of configs for OSS ci and another set for servicelab we combine both here into one common dictionary Reviewed By: danzimm Differential Revision: D64183688 fbshipit-source-id: fa47780a3bf3ba8669c6e8fd406cff5542fd06e6

Summary: put an extra `m` by mistake in one of the configs and it is breaking in OSS Reviewed By: plotfi Differential Revision: D64208508 fbshipit-source-id: f1461da0a5e883ffd4266206f5e3b737f468c3b2

Summary: X-link: pytorch/pytorch#137617 Approved by: https://github.com/jansel Reviewed By: jovianjaison Differential Revision: D64202324 Pulled By: williamwen42 fbshipit-source-id: 526f32cabeb891c8c9481799f45436cfd19e7dc2

Summary: TSIA Reviewed By: danzimm, aakhundov Differential Revision: D64268555 fbshipit-source-id: e380f9401b08c2b7d9a48bedc6d791b9b39cd533

…/ `torchgen/` with `ruff format` (#132577) Summary: X-link: pytorch/pytorch#132577 Approved by: https://github.com/malfet Reviewed By: jovianjaison Differential Revision: D64256966 fbshipit-source-id: e9725ccc5a814ef3b30e244e988ed9b7238b6ccb

Summary: As described in pytorch/pytorch#136168, I'm trying to migrate native PyTorch implementation comparison([the original operatorbench](https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/microbenchmarks/operatorbench.py)) to TritonBench. This PR adds an Operator Loader which can load aten ops used in TorchBench, HuggingFace, and TIMM models. The benchmark classes are dynamically created. Then benchmark them between aten and inductor implementations. Files `torchbenchmark/operator_loader/operator_inp_utils.py`, `torchbenchmark/operator_loader/operatorbench.py`, and all configs files in `torchbenchmark/operator_loader/operator_inp_logs/` are copied from original operatorbench. Example commands: ```bash python run_benchmark.py triton --op aten._softmax.default --num-inputs 1 --operator-loader --precision fp16 ``` Exampled Output: ``` Evaluating an op name into an OpOverload: The underlying op of 'aten.upsample_nearest2d_backward' has no overload name 'vec' Evaluating an op name into an OpOverload: '_OpNamespace' 'aten' object has no attribute 'im2col_backward' Evaluating an op name into an OpOverload: '_OpNamespace' 'aten' object has no attribute 'col2im_backward' Evaluating an op name into an OpOverload: '_OpNamespace' 'aten' object has no attribute 'im2col_backward' Evaluating an op name into an OpOverload: The underlying op of 'aten.upsample_bilinear2d_backward' has no overload name 'vec' Evaluating an op name into an OpOverload: The underlying op of 'aten.upsample_nearest2d_backward' has no overload name 'vec' 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.20s/it] x_val eager-latency inductor-latency ------- --------------- ------------------ 0 0.090592 0.089632 1 0.055808 0.038112 ``` Pull Request resolved: pytorch#2495 Reviewed By: xuzhao9 Differential Revision: D64200358 Pulled By: FindHao fbshipit-source-id: f0121168b33247224bc905a1a88af69e4b13def6

…36749) Summary: Fixes pytorch/pytorch#123503. pytorch/pytorch#121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm. X-link: pytorch/pytorch#136749 Approved by: https://github.com/jgong5, https://github.com/jansel Reviewed By: jovianjaison Differential Revision: D64290722 fbshipit-source-id: a3e7248f57a97cd767257354d410b3508b5e0325

Summary: TSIA Reviewed By: danzimm Differential Revision: D64334048 fbshipit-source-id: d01b20161407400d0afd28460bce8095c91d9056

Summary: X-link: pytorch/pytorch#137216 Approved by: https://github.com/ezyang Reviewed By: clee2000 Differential Revision: D64290696 Pulled By: jovianjaison fbshipit-source-id: 06886bfb7e3f37895e3a8bf567366e4c4cc1d248 Co-authored-by: Aaron Gokaslan <[email protected]>

Summary: since we have added flexibility for different sets of metrics per operater we want to skip messages for empty metrics Reviewed By: nmacchioni Differential Revision: D64345289 fbshipit-source-id: d5b1fff90c6acd530867d0b6ef3ea97bc6f41cf5

Summary: Signed-off-by: Edward Z. Yang <[email protected]> X-link: pytorch/pytorch#137867 Approved by: https://github.com/bobrenjc93 Reviewed By: clee2000 Differential Revision: D64418349 Pulled By: ezyang fbshipit-source-id: 265e07753a3549e6866d45fbdb8a435b6e7dc787

Summary: We need https://github.com/Dao-AILab/flash-attention/pull/1053/files to externally import `flash_attn_interface` for FA3. Pull Request resolved: pytorch#2500 Reviewed By: bertmaher Differential Revision: D64190441 Pulled By: xuzhao9 fbshipit-source-id: ff20f0a28514b645c828853e7f15808ed1597ae6

Summary: This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. X-link: pytorch/pytorch#137677 Approved by: https://github.com/zou3519 Reviewed By: clee2000 Differential Revision: D64404928 Pulled By: aakhundov fbshipit-source-id: c812cea3867c55800d5fe213bf07bf21292345e3

Summary: This PR adds a ncu report analyzer to analyze the profiled ncu report. It also adds two metrics `memory_traffic` and `arithmetic_intensity`. To avoid excessive profiling overhead, we only profile with necessary ncu metrics. This PR is a part of [operator benchmarking plan](pytorch/pytorch#136168) Example commands: ``` python run_benchmark.py triton --op gather_gemv --num-inputs 1 --metrics memory_traffic,arithmetic_intensity --csv ``` Example output: ``` 0%| | 0/1 [00:00<?, ?it/s]==PROF== Connected to process 508958 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10) ==PROF== Profiling "index_elementwise_kernel" - 0: 0%....50%....100% - 3 passes ==PROF== Profiling "unrolled_elementwise_kernel" - 1: 0%....50%....100% - 3 passes ==PROF== Profiling "gemv2T_kernel_val" - 2: 0%....50%....100% - 3 passes 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.89s/it] x_val;test_eager-_ncu_trace_in_task 2048;success ==PROF== Disconnected from process 508958 ==WARNING== No source files were imported. Check that the target application was compiled with -lineinfo. ==PROF== Report: /scratch/yhao/tmp/tritonbench/gather_gemv/ncu_traces/test_eager_0/ncu_output.ncu-rep 0%| | 0/1 [00:00<?, ?it/s]==PROF== Connected to process 509121 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10) ==PROF== Profiling "triton_red_fused_mv_0" - 0: 0%....50%....100% - 3 passes 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.79s/it] x_val;test_0-_ncu_trace_in_task 2048;success ==PROF== Disconnected from process 509121 ==PROF== Report: /scratch/yhao/tmp/tritonbench/gather_gemv/ncu_traces/test_0_0/ncu_output.ncu-rep 0%| | 0/1 [00:00<?, ?it/s]==PROF== Connected to process 509285 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10) ==PROF== Profiling "triton_red_fused_mv_0" - 0: 0%....50%....100% - 3 passes ==PROF== Connected to process 509433 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.07s/it] x_val;test_inductor-_ncu_trace_in_task 2048;success ==PROF== Disconnected from process 509285 ==PROF== Disconnected from process 509433 ==PROF== Report: /scratch/yhao/tmp/tritonbench/gather_gemv/ncu_traces/test_inductor_0/ncu_output.ncu-rep 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.99s/it] x_val;test_eager-arithmetic_intensity;test_eager-memory_traffic;test_eager-weighted_fp32_arithmetic_intensity;test_0-arithmetic_intensity;test_0-memory_traffic;test_0-weighted_fp32_arithmetic_intensity;test_inductor-arithmetic_intensity;test_inductor-memory_traffic;test_inductor-weighted_fp32_arithmetic_intensity 2048;(0.14937214493924472, 0.0);(29467392.0, 505856.0);0.14937214493924472;(4.364079147640791, 0.0);(4204544.0, 256.0);4.364079147640791;(9.97989888530182, 0.0);(4202752.0, 0.0);9.97989888530182 ``` according to ncu, there can be multiple roofline charts on different granularity, such as single precision, double precision, tensorcore, and half precision. Pull Request resolved: pytorch#2497 Reviewed By: xuzhao9 Differential Revision: D64359055 Pulled By: FindHao fbshipit-source-id: a02a4ebfcac5c5209f4196aac5a8eb4f91b3de87

Summary: The current GPU memory metric backend includes dcgm and nvml. They are reported from hardware and should be accurate. This PR adds the native torch way to collect GPU memory usage. It uses `torch.cuda.max_memory_allocated()`. The benefit is that it has lower overhead and is accurate on a shared GPU server when there are mutliple GPU processes from other users. It is because we don't implement the process filter for the other two backends. Use `--metrics-gpu-backend torch` to set the backend. Pull Request resolved: pytorch#2501 Reviewed By: xuzhao9 Differential Revision: D64253410 Pulled By: FindHao fbshipit-source-id: 09b0579846a6830e0e9735e8daeba4abd88bab17

Summary: Pull Request resolved: pytorch#2498 Reviewed By: kit1980 Differential Revision: D64407151 Pulled By: atalman fbshipit-source-id: 0637d812144f13dad41b640e70fd65619a183c67

Summary: This PR add `--op-collection` to tritonbench. It can run multiple ops in defined operator collections. The default collection includes all ops not included in other collections. Operator collections are defined in `torchbenchmark/operators_collection/`. For each collection, you should define a `get_operators` function to return operators included in this collection. Pull Request resolved: pytorch#2503 Reviewed By: xuzhao9 Differential Revision: D64359380 Pulled By: FindHao fbshipit-source-id: c66dd254a3c8b70c112d9b7774482813e0236789

Summary: Update imports for latest updates + silu_mul interface change Reviewed By: jianyuh Differential Revision: D64516452 fbshipit-source-id: b9b98a6eda45a093661e8b23f6b8ec300b559960

Summary: Add documentation for adding custom ops. Pull Request resolved: pytorch#2509 Reviewed By: xuzhao9 Differential Revision: D64497281 Pulled By: FindHao fbshipit-source-id: 20f4096ebbce53c7d9a713cacbde016c521aa7c3

Summary: As the title goes. Reviewed By: bertmaher Differential Revision: D64480822 fbshipit-source-id: ec1d17be0619fb35d4d8f774eab2858e75afe2e3

Summary: In unit test, run both forward and backward pass. If the backward pass throws `NotImplementedError`, skip the test since the operator does not support backward pass. Reviewed By: int3 Differential Revision: D64471087 fbshipit-source-id: c9d0c43544314fc11305f271e8e80f7ba07b2675

Summary: In the CI, we will check that all registered impls are available in the output, unless they are specified as`ci=False`. We add the `ci=` flag because right now we don't have lazy imports to import optional backend modules while we want different behavior between flags `enabled` and `ci`. For `enabled` flag, we want "best-effort". If a module is not available (e.g. flash attention 3 is not available on A100), we should check if it is not available, then skip it automatically instead of error out for the best user experience. For `ci` flag, we want to make sure that things are already setup in fbcode CI, and if flash attention 3 is not available, it is a red flag and we have to report it in the unit test. Reviewed By: bertmaher Differential Revision: D64473609 fbshipit-source-id: 320255f73942705038d50aac1f14d318b62a4765

Summary: X-link: pytorch/pytorch#138231 Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/aakhundov Reviewed By: clee2000 Differential Revision: D64581452 Pulled By: anijain2305 fbshipit-source-id: 3b9ff53abf2c4e1c525d7e62a52285279d2d4109

Summary: Pull Request resolved: pytorch#2511 X-link: pytorch/pytorch#138097 ^^ Reviewed By: ezyang Differential Revision: D64438144 fbshipit-source-id: 87a5518d4d9318132d269302c93a285bf86f3a46

Summary: X-link: pytorch/pytorch#138093 This diff is the starting steps of https://docs.google.com/document/u/2/d/1kAEBt4AyW7HTAhXHbjoz8FBFHNyyEA2Qo2mPn7v3WUQ/edit?usp=drive_web&ouid=113555078003219714709 It implements the following changes: - Only log spans to scuba, so no start events are ever logged - Log events as the full event name, without "START" or "END" - Only log to scuba major phases from chromium events. These are: - entire_frame_compile (dynamo) - backend_compile (aotdispatch) - inductor_compile (inductor) - codegen (inductor codegen) Tlparse chromium events stay basically the same. But I implemented a few changes to clean that up as well: - When there's a phase name available, log the phase name instead of the function name as the event name. This simplifies the trace to not have two identical rows. The fn_name is avaliable as metadata on the chromium event, if interested - Log new events for pre and post grad passes. These do *not* log to scuba. By making the phases much simpler in Scuba, with only categories for major phases of PT2 Compilation, we pave the way to add **much** more metadata and information to each individual event type. Diffs for that will come later. **IMPLEMENTATION NOTES:** - The logic for `log_chromium_event_internal` (which is the function that logs to Scuba) lives in chromium_events for now, but in the future as we add more metadata, it may belong independently in dynamo_timed or even outside of dynamo_timed. I haven't explored in detail what the refactor will look like. Once we start logging metadata for dynamo, aotdispatch, inductor, I suspect we will call log_pt2_compile_event directly, instead of making chromium event logger handle the pt2_compile_event logic. But that refactor is left for another PR on top of this one. - There's an interesting space after pre grad passes within AOT autograd logic, that's between create_aot_dispatcher_function and pre grad passes. I'm not sure what we're spending time doing in that time, but I'll find out with a profile later. ghstack-source-id: 248790387 Reviewed By: oulgen Differential Revision: D64479033 fbshipit-source-id: 1f30e734160bfed2f664063b5b2f4df1b661dfa4

Differential Revision: D64438144 Original commit changeset: 87a5518d4d93 Original Phabricator Diff: D64438144 fbshipit-source-id: 3acb559a632ce345a1c3c88edc9007c0a9e5d40c

Summary: current aggregation does not seem to be working as expected. Adding another aggregation field before changing the previous over Reviewed By: xuzhao9 Differential Revision: D64616616 fbshipit-source-id: 676f09035e0d4427e9b60e9ed8f8c790782f0aec

Summary: more specific logging in our logging table based on servicelab benchmark names Reviewed By: nmacchioni Differential Revision: D64627855 fbshipit-source-id: 47e250c5d8a34a912e7885e1f997a90a9dd8bc10

Summary: X-link: pytorch/opacus#681 X-link: pytorch/captum#1389 X-link: pytorch/botorch#2586 X-link: pytorch/audio#3846 This replaces uses of `numpy.ndarray` in type annotations with `numpy.typing.NDArray`. In Numpy-1.24.0+ `numpy.ndarray` is annotated as generic type. Without template parameters it triggers static analysis errors: ```counterexample Generic type `ndarray` expects 2 type parameters. ``` `numpy.typing.NDArray` is an alias that provides default template parameters. Reviewed By: ryanthomasjohnson Differential Revision: D64619891 fbshipit-source-id: dffc096b1ce90d11e73d475f0bbcb8867ed9ef01

…iled bytecode (#137669) Summary: Fixes pytorch/pytorch#114369 X-link: pytorch/pytorch#137669 Approved by: https://github.com/anijain2305 Reviewed By: wdvr Differential Revision: D64675139 Pulled By: mlazos fbshipit-source-id: a5e4501eaa781fcbd9423c99c555949182bd9f24

Summary: for some reason OSS isnt happy with dict.get so im moving to this slightly less pythonic but more exact approach Reviewed By: bertmaher, sfzhu93 Differential Revision: D64698791 fbshipit-source-id: 48cc4b6f7df61287efdc71c30176c2830dfde110

Summary: ## The problem In a typical debugger, `repr()` is used to display variables and not `str()`. Several classes in Dynamo have a `__str__()` method that returns useful information and a `__repr__()` that does not. Having to call `str(x)` or `[str(i) for i in x]` in the debugger all the time is a chore. `str()` should be ["informal, nicely printable"](https://docs.python.org/3/library/stdtypes.html#str) and `repr()` should ["attempt to return a string that would yield an object with the same value when passed to eval()](https://docs.python.org/3/library/functions.html#repr)". ## The solution In the Python object model, if there is no `__str__` method, `__repr__` is used instead (but not the other way around). So renaming `__str__` to `__repr__` in a few cases where no `__repr__` method exists now should not change observable behavior, and should make debugging easier. The specific classes changed were all in `torch._dynamo.variables`: * `builtin.BuiltinVariable` * `constant.ConstantVariable` * `constant.EnumVariable` * `functions.UserMethodVariable` * `lazy.LazyVariableTracker` * `lazy.LazySymNodeFormatString` * `misc.GetAttrVariable` * `misc.NullVariable` * `user_defined.UserDefinedObjectVariable` X-link: pytorch/pytorch#136316 Approved by: https://github.com/XuehaiPan, https://github.com/jansel Reviewed By: wdvr Differential Revision: D64714511 fbshipit-source-id: 322f2f0110e5b45afe6a27c52a0bcc91d91d1d6a

Summary: attempt to fix dependencies - this is no longer compatible with the latest huggingface_hub, see failing test at https://github.com/pytorch/pytorch/actions/runs/11445304501/job/31843081598 Pull Request resolved: pytorch#2523 Reviewed By: huydhn Differential Revision: D64711662 Pulled By: wdvr fbshipit-source-id: eed9143e6e0531840a53ba5ab3fad04894727272

Summary: Some fixes for pytorch/pytorch#137602 Pull Request resolved: pytorch#2514 Reviewed By: xuzhao9 Differential Revision: D64628614 Pulled By: mikaylagawarecki fbshipit-source-id: edebf25cc6648919d5673a3baeaffdac26e5b91f

Summary: Type annotations for compile_fx. - Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed. - There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py. X-link: pytorch/pytorch#138033 Approved by: https://github.com/Skylion007 Reviewed By: wdvr Differential Revision: D64714765 Pulled By: aorenste fbshipit-source-id: 262f5cb9b2171e96ce9f895772bd5778ddb4ebe0

Summary: X-link: pytorch/pytorch#138477 This diff does a few things: ## Add metadata to events in progress Adds the ability to add extra metadata to Chromium Events via `add_event_data`. Metadata can only be added to chromium events that have started, but not ended (so, in progress events) - When you add the data, the metadata is appended to the metadata when you call log_event_end(). - The metadata appears in chromium events in tlparse. It also gets logged to scuba. ## New `dynamo` chromium event We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes: ``` __start__ -> dynamo (dynamo compile metrics) -> entire_frame_compile (compile.inner) -> backend_compile (i.e. aotdispatch) -> create_aot_dispatch_function -> inductor_compile -> ... ``` BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here. *FAQ: Why can't we use `entire_frame_compile` as the event?* This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo. ## Log metadata as separate columns (Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system. ghstack-source-id: 249373269 Reviewed By: aorenste Differential Revision: D64696287 fbshipit-source-id: 441f57e2d1c0210e81c06eb86d4482e95bed4971

Summary: X-link: pytorch/pytorch#138505 Approved by: https://github.com/oulgen Reviewed By: oulgen Differential Revision: D64711721 Pulled By: masnesral fbshipit-source-id: 488dd527d0b9179644ae5d6d45d88bdab0224032

Summary: Multithreaded doesn't work yet, this adds python side TLS only for the python side state X-link: pytorch/pytorch#137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953 Reviewed By: wdvr Differential Revision: D64796212 Pulled By: xmfan fbshipit-source-id: aa1d9ef8f6e61207dfb352866e37d5e7cc98df42

Summary: X-link: pytorch/pytorch#138061 Approved by: https://github.com/yf225 ghstack dependencies: #137953, #137821 Reviewed By: wdvr Differential Revision: D64796226 Pulled By: xmfan fbshipit-source-id: 9bf80c1492d7a800a308cb1e99fac63c4752fc52

Summary: TSIA draft diff while I move this to its own op Reviewed By: danzimm Differential Revision: D64781204 fbshipit-source-id: c3ddd956230c1e4c8166867f03b5a28e8d6586e9

Summary: Fixes pytorch/pytorch#138654 X-link: pytorch/pytorch#138657 Approved by: https://github.com/williamwen42, https://github.com/jansel Reviewed By: wdvr Differential Revision: D64881833 Pulled By: anijain2305 fbshipit-source-id: 46bcffa12ef2bec0ff47a1b60323aacbb3a90872

Summary: X-link: pytorch/pytorch#138619 Approved by: https://github.com/williamwen42 ghstack dependencies: #138657 Reviewed By: wdvr Differential Revision: D64881836 Pulled By: anijain2305 fbshipit-source-id: 1974dbc228618e8597eb6ab293272ee985964f52

Summary: currently device and hardware are flipped in logging table due to args mismatch Reviewed By: xuzhao9 Differential Revision: D64911847 fbshipit-source-id: 2d75b17046eae2eed0d83f86140ad88dae26de29

Summary: Pull Request resolved: pytorch#2525 Reviewed By: kit1980 Differential Revision: D64912654 Pulled By: atalman fbshipit-source-id: 74cf57574c7ed5e1b6a4fee4b9c2de745deb21c0

Summary: Pull Request resolved: pytorch#2524 Reviewed By: kit1980 Differential Revision: D64771621 Pulled By: mikaylagawarecki fbshipit-source-id: 545f3d528cfbe2668c8d37e98e99423cd77a8e8e

Summary: getting gemm operator to work for amd Reviewed By: danzimm, xuzhao9 Differential Revision: D64976612 fbshipit-source-id: 20aaf30732211848996a3575ca7356f514ed912c

Summary: Pull Request resolved: pytorch#2512 X-link: pytorch/pytorch#138164 Capture the timing for the remote fx graph cache get and put operations and add them to the logger logging. Reviewed By: ezyang, oulgen Differential Revision: D64484025 fbshipit-source-id: 3ac8dad8f7083d7eefaa6f092d7703488a8bc41f

Reviewed By: xuzhao9 Differential Revision: D64683154 fbshipit-source-id: 70d359538572947c15184255fe5b2e69f61ab04a

Reviewed By: xuzhao9 Differential Revision: D64683332 fbshipit-source-id: f132eda07a1cde19116ce18f5b400d896df53612

Summary: Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ X-link: pytorch/pytorch#133814 Approved by: https://github.com/ezyang Reviewed By: wdvr Differential Revision: D65030974 fbshipit-source-id: 6e04f555c9ac4a60d7f53ab23ad3b60b82de5d48

Summary: X-link: pytorch/pytorch#138896 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #138512 Reviewed By: wdvr Differential Revision: D65030963 Pulled By: anijain2305 fbshipit-source-id: 7423473e4c3613aea42e13a64eae9c417c876964

Summary: Select CK or Cutlass based on the arch. Reviewed By: xuzhao9 Differential Revision: D65060122 fbshipit-source-id: 3406e4852efe30883474d4bbb2315ffe4c54e211

…rad] Compiled autograd configs in TLS (#137821)" Summary: X-link: pytorch/pytorch#139086 Original commit changeset: 9bf80c1492d7 Original Phabricator Diff: D64796226 Original commit changeset: aa1d9ef8f6e6 Original Phabricator Diff: D64796212 Reviewed By: malfet, kflu Differential Revision: D65072644 fbshipit-source-id: 50ad138fc216653987a80ea6ae3efeaf5c04f949

Summary: Companion logger diff: https://www.internalfb.com/diff/D65012523 * Using float seconds for timestamps is bad because our internal system defaults to float32 precision and you don't even get second precision for timestamps in float32 * We decide to use microseconds instead of milliseconds because millisecond granularity you can end up with the same timestamp if compilation is happening very quickly; much better to force non-overlapping spans * Because there are so many new fields and I don't feel like reimplementing each on BwdCompilationMetrics, BwdCompilationMetrics is no more, it's just that everything in CompilationMetrics is now optional. * The actual frame compile times collection is not modified (still float) to reduce blast radius, so I just convert to microseconds before making the record. At float64 precision (Python's default), you get about microsecond precision on timestamps so shouldn't be a data problem (https://www.leebutterman.com/2021/02/01/store-your-unix-epoch-times-as-float64.html) * I rename some entries for clarity. In particular, whenever a timing contains all of the its lower phases (e.g., how Inductor also contains Triton compilation) we put "cumulative" in its name. If something doesn't happen at compile time but is delayed until we have actual real inputs, we put "runtime" in its name. X-link: pytorch/pytorch#138975 Approved by: https://github.com/masnesral Reviewed By: huydhn Differential Revision: D65088198 Pulled By: ezyang fbshipit-source-id: 0b901357ab649f052a3553fe8d0cc37fba80e197

Summary: This PR enables you to inspect PyObjects in C using `INSPECT(...)` without requiring https://docs.python.org/3/howto/gdb_helpers.html. `torch._dynamo.eval_frame.raise_sigtrap` can also be used to set gdb breakpoints while running Python code, e.g. ```python x = x + 1 torch._dynamo.eval_frame.raise_sigtrap(); # can breakpoint on ceval.c:CALL to breakpoint the `sin` call in C. x = torch.sin(x) ``` X-link: pytorch/pytorch#138030 Approved by: https://github.com/jansel Reviewed By: huydhn Differential Revision: D65104659 Pulled By: williamwen42 fbshipit-source-id: aa2f3f9c34a1ee15160ccc82bf61c740b3ac6d20

Summary: The default value for use_cuda_graphs was changed to False in D64471087 and this caused slowdowns in triton/ck kernels for fp8_gemm_rowwise. Reviewed By: danzimm Differential Revision: D65140285 fbshipit-source-id: 4ab77537afeb9108dab7cdef6cac34aaa39d7d73

Summary: Pull Request resolved: pytorch#2526 X-link: pytorch-labs/tritonbench#19 As title Reviewed By: xuzhao9, LinjianMa Differential Revision: D65069124 fbshipit-source-id: 1ee736396fecc76d606e637fee7a8127603d9d7e

Summary: Pull Request resolved: pytorch#2528 Reviewed By: xuzhao9 Differential Revision: D64935089 fbshipit-source-id: 8b0aa81513a3c6a58e8876475ec63041d362d42a

Summary: X-link: pytorch/pytorch#139289 We should be logging metadata from event starts to PT2 Compile Events too. ghstack-source-id: 250444771 Reviewed By: oulgen Differential Revision: D65070086 fbshipit-source-id: 63b934bff4254871e15a615e5aa47112b032b143

Summary: X-link: pytorch/pytorch#139309 Per discussion from https://fb.workplace.com/groups/1286739428954016/posts/1360522894909002 This diff considerably changes the column format of PT2 Compile Events. We only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba. Logging all metadata into a metadata column saves space and ingestion because for any new rows that are not the same event, you don't get N new empty column markers. It comes at the cost of having to create new derived columns in the Scuba UI for using all the extra metadata we care about. But that's a tradeoff we're willing to make here, considering other tables like dynamo_compile exists. ghstack-source-id: 251214365 exported-using-ghexport Reviewed By: oulgen Differential Revision: D65225598 fbshipit-source-id: 01569a79174ed3699063dbd8bb26b883c6a7b0c4

Summary: When benchmarking across multiple operators, we can optionally isolate each operator run in a child process. Reviewed By: FindHao Differential Revision: D65154665 fbshipit-source-id: 9c9a21a76897084b061374cb3f7d8524a4aaac9b

Summary: X-link: pytorch/pytorch#139240 use signpost logs, a followup is to remove the field possibly_missed_reinplacing_opportunities form dynamo compile table. Reviewed By: zou3519 Differential Revision: D65180194 fbshipit-source-id: 20fe80f209a15573b2184e4cf7ed2be3c2a4ab94

Summary: Reland pytorch/pytorch#139154 X-link: pytorch/pytorch#139597 Approved by: https://github.com/angelayi Reviewed By: ZainRizvi Differential Revision: D65455707 Pulled By: desertfire fbshipit-source-id: 691882e606754fc04cb826a14bdfe94cb465ece8

Summary: Fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_is_integer_num_type6_dynamic_shapes` when specialize_float = False X-link: pytorch/pytorch#139572 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568 Reviewed By: ZainRizvi Differential Revision: D65492888 Pulled By: bobrenjc93 fbshipit-source-id: 9a9881caa5905686c44d8508ce5edab46ab03f28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DDPWrapper #2479

Add DDPWrapper #2479

Commits on Sep 26, 2024

Commits on Sep 28, 2024

Commits on Oct 1, 2024

Commits on Oct 2, 2024

Commits on Oct 3, 2024

Commits on Oct 4, 2024

Commits on Oct 5, 2024

Commits on Oct 7, 2024

Commits on Oct 8, 2024

Commits on Oct 9, 2024

Commits on Oct 10, 2024

Commits on Oct 11, 2024

Commits on Oct 12, 2024

Commits on Oct 14, 2024

Commits on Oct 16, 2024

Commits on Oct 17, 2024

Commits on Oct 18, 2024

Commits on Oct 19, 2024

Commits on Oct 21, 2024

Commits on Oct 22, 2024

Commits on Oct 23, 2024

Commits on Oct 24, 2024

Commits on Oct 25, 2024

Commits on Oct 26, 2024

Commits on Oct 27, 2024

Commits on Oct 28, 2024

Commits on Oct 29, 2024

Commits on Oct 31, 2024

Commits on Nov 1, 2024

Commits on Nov 4, 2024

Commits on Nov 5, 2024

Commits on Nov 6, 2024

Commits on Nov 11, 2024