xccl-p2p #3

Chao1Han · 2024-09-18T06:17:09Z

Fixes #ISSUE_NUMBER

This is a first step towards removing builds dependency to conda. Currently we build magma as a conda package in a pytorch conda channel, implemented in https://github.com/pytorch/builder/tree/a1b372dbda2e9e3bd946cf135aa3b0137dfdf052/magma. This commit adapts the logic from pytorch/builder as follows: - use pytorch/manylinux-cuda<cuda-version> as base image - apply patches and invoke the build.sh script directly (not anymore through conda build) - stores license and build files along with the built artifact, in an info subfolder - create a tarball file which resembles that created by conda, without any conda-specific metadata A new matrix workflow is added, which runs the build for each supported cuda version, and uploads the binaries to pyorch s3 bucket. For the upload, define an upload.sh script, which will be used by the magma windows job as well, to upload to `s3://ossci-*` buckets. The build runs on PR and push, upload runs in DRY_RUN mode in case of PR. Fixes pytorch#139397 Pull Request resolved: pytorch#139888 Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/seemethere

Fixes pytorch#139182 In Triton 3.2 num_stages=0 will be deprecated with Triton's AMD backend. Let's query default num_stages from the relevant triton backend Pull Request resolved: pytorch#139881 Approved by: https://github.com/bertmaher

…nels (pytorch#139851) This PR adds support for the `restore_value` argument of the `@triton.autotune` for the user-defined Triton kernels in PT2. The `kernel.restore_idx` are extracted in the `ir.UserDefinedTritonKernel` and the corresponding arg names are placed into the `triton_meta["restore_value"]`. From there, those are added to the existing `mutated_arg_names` in the caching autotuner infra which already exists and leads to the listed argss being cloned. This achieves the equivalent effect to the native `restore_value`. Pull Request resolved: pytorch#139851 Approved by: https://github.com/oulgen

Fixes pytorch#126278 - Change return type description of `torch.chunk` to tuple - Add type for input parameters **Before** ![image](https://github.com/user-attachments/assets/087b6cfa-0815-443b-a69a-785ca4b421d7) **After** ![image](https://github.com/user-attachments/assets/19532553-6004-4246-a6cf-f7f685f5775c) Pull Request resolved: pytorch#140089 Approved by: https://github.com/awgu

…ted PRs (pytorch#138623) It should help with triaging ROCm-inductor-related breakages and surfacing them in the PRs itself. Pull Request resolved: pytorch#138623 Approved by: https://github.com/huydhn

The configure-aws-credentials action requires special permissions: https://github.com/aws-actions/configure-aws-credentials?tab=readme-ov-file#oidc Give "id-token: write" permssion to the job that sets the AWS credentials to upload to the S3 bucket. Fixes pytorch#139397 Pull Request resolved: pytorch#140141 Approved by: https://github.com/atalman

This PR removes the warning message on Windows on Arm64, which was triggered by an issue in one of the DLLs, to improve the user experience. `Microsoft Visual C++ Redistributable is not installed, this may lead to the DLL load failure. It can be downloaded at https://aka.ms/vs/16/release/vc_redist.x64.exe` The issue is being tracked here: https://developercommunity.visualstudio.com/t/VCRUNTIME140_1DLL-Miscompiled-for-Arm64/10781635? Pull Request resolved: pytorch#139746 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>

Copied from pytorch#139652 Pull Request resolved: pytorch#140137 Approved by: https://github.com/huydhn, https://github.com/malfet

Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with : > RuntimeError: View operation returned a tensor that is the same as the input base tensor. This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using. Fix for pytorch#133394 Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070) Pull Request resolved: pytorch#140065 Approved by: https://github.com/bdhirsh

…tor-related PRs (pytorch#138623)" This reverts commit ee7c3db. Reverted pytorch#138623 on behalf of https://github.com/huydhn due to I think the link failure is legit, it complains about the wrong concurrency setting in the workflow ([comment](pytorch#138623 (comment)))

Summary: Tighten the AOTIModelContainerRunner::run interface to take a const vector of at::Tensor, which 1) makes it clear that the runner will not modify the input tensor vector; 2) runner will be able to take a temp vector of tensors as the input. Pull Request resolved: pytorch#139955 Approved by: https://github.com/chenyang78

…se class (pytorch#139084) This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff. Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/) Pull Request resolved: pytorch#139084 Approved by: https://github.com/malfet

When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there. Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base? Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/) Pull Request resolved: pytorch#139090 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: pytorch#139084

…rch#140109) Fix pytorch#140095 Pull Request resolved: pytorch#140109 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison

Fix typo with Associative_Scan tests Pull Request resolved: pytorch#139929 Approved by: https://github.com/ydwu4

Part of implementing pytorch#93753 Pull Request resolved: pytorch#139950 Approved by: https://github.com/jansel ghstack dependencies: pytorch#139921

…h#140037) Part of implementing pytorch#93753. Next step will be to use a lower overhead data structure over `py::dict`. Pull Request resolved: pytorch#140037 Approved by: https://github.com/jansel ghstack dependencies: pytorch#139921, pytorch#139950

…ts (pytorch#139953) Pull Request resolved: pytorch#139953 Approved by: https://github.com/jansel

This should remove the pybind noise from the profiling. Pull Request resolved: pytorch#140110 Approved by: https://github.com/jansel ghstack dependencies: pytorch#139953

…evice code (pytorch#139966)" This reverts commit ca7fdfe. Reverted pytorch#139966 on behalf of https://github.com/malfet due to This approach will prevent one from using get_type_index from device code ([comment](pytorch#139966 (comment)))

Summary: Update the OSS tutorial to use the new aoti_compile_and_package and aoti_load_package APIs. Pull Request resolved: pytorch#139956 Approved by: https://github.com/angelayi ghstack dependencies: pytorch#139955

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#140166 Approved by: https://github.com/Skylion007

Summary: This diff reverts D65490202 This is causing tests to fail on open source. See distributed/test_c10d_logger.py::C10dErrorLoggerTest::test_exception_logger [GH job link](https://github.com/pytorch/pytorch/actions/runs/11736922614/job/32697709457) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/ba9645f6e51bb98b39ca8b351dd7fee786083372) Test Plan: NA Differential Revision: D65663063 Pull Request resolved: pytorch#140142 Approved by: https://github.com/malfet, https://github.com/huydhn

This reverts commit 7fa94f0. Reverted pytorch#139929 on behalf of https://github.com/ZainRizvi due to This test is breaking in trunk somehow, which is really weird. functorch/test_control_flow.py::AssociativeScanTests::test_associative_scan_binary_operator_compile_mode_compile_dynamic_shape_combine_mode_pointwise_reverse_False_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11747748990/job/32732254909) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/7fa94f03635709a30ef85c6955dcdd5051e72e71) ([comment](pytorch#139929 (comment)))

…h#136526)" This reverts commit c03324d. Reverted pytorch#136526 on behalf of https://github.com/ZainRizvi due to This fails to build internally. See D65604944 for more details ([comment](pytorch#136526 (comment)))

…#139595)" This reverts commit d72a308. Reverted pytorch#139595 on behalf of https://github.com/ZainRizvi due to Sorry but the newly added tests in test_mkldnn_pattern_matcher.py fail internally. See D65661038 for more details ([comment](pytorch#139595 (comment)))

…rch#136827)" This reverts commit cf0bb6c. Reverted pytorch#136827 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. See D65605094 for more details ([comment](pytorch#136827 (comment)))

resolves: pytorch#138721 Summary: Delete the uses of deleted nodes. The double for-loop is icky here, but N should be pretty small and removing it requires refactoring the datastructures involved, which is a bigger endeavor. Test Plan: Normal test coverage should be sufficient. There were a couple of spots in the scheduler code that didn't check users being deleted, so I'll run a perf test to see what impact that has, and to make sure N^2 doesn't affect compile times. Perf: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2029%20Oct%202024%2017%3A41%3A36%20GMT&stopTime=Tue%2C%2005%20Nov%202024%2018%3A41%3A36%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=exclamaforte/prune-deleted-users&lCommit=5cb1aa6f7d8a52acdae0c7cf36b8c2d536d7f0d1&rBranch=main&rCommit=f4ee5a243dbb31e6310e5632b1c87898b299df2c off of nov4 nightly Pull Request resolved: pytorch#139447 Approved by: https://github.com/eellison

As manylinuxaarch64-builder already comes pre-built with all versions of python runtime Refactor logic for setting path to DESIRED_PYTHON from `manywheel/build_common` into `set_desired_python.sh` and call it from aarch64_ci_setup.sh In followup PRs move scons and ninja installation into base docker image Pull Request resolved: pytorch#140093 Approved by: https://github.com/atalman

Summary: It seems like this issues is due to leftover cupti events during warmup staying persistent in the queue during profiling. These events start before our actual time window and therefore have a timestamp lower than our basetime. This makes the delta become negative which results in unsigned overflow. This then creates a large number which later gets sign added which creates the signed overflow. Solution: If a raw timestamp is less than the base timestamp, just mark the process timestamp as -1 so we can mark these events as "to ignore". In Kineto, add a special case to ignore timestamps that are negative. Test Plan: Test with ASAN Differential Revision: D65835650 Pull Request resolved: pytorch#140441 Approved by: https://github.com/davidberard98

Export the number of devices so that it can be used in ut. Pull Request resolved: pytorch#140492 Approved by: https://github.com/ezyang

…40577) Differential Revision: [D65895743](https://our.internmc.facebook.com/intern/diff/D65895743) Pull Request resolved: pytorch#140577 Approved by: https://github.com/zou3519

Summary: output nodes may be eliminated to the input nodes if only partial output nodes are specified. add option to check results for all output nodes in the partitioned graph Test Plan: see D65367305 Reviewed By: qcyuan Differential Revision: D65367305 Pull Request resolved: pytorch#139774 Approved by: https://github.com/jfix71

Re make of pytorch#140587 Pull Request resolved: pytorch#140578 Approved by: https://github.com/huydhn

This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true: * `out.shape[dim] == (in.shape[dim] - size) / step + 1` * `out.shape[-1] == size` * `out.ndim == in.ndim + 1` `unfold_backward` Metal kernel receives `grad_in` and returns `grad_out` such that: * `grad_in.shape == out.shape` * `grad_out.shape == in.shape` For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads. That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range. This operator has been requested 16 times on pytorch#77764 Pull Request resolved: pytorch#135411 Approved by: https://github.com/manuelcandales Co-authored-by: Manuel Candales <[email protected]>

Fixes pytorch#140422 Pull Request resolved: pytorch#140558 Approved by: https://github.com/davidberard98

Fixes pytorch#140420 Pull Request resolved: pytorch#140489 Approved by: https://github.com/davidberard98

Summary: Removes print statements and implements logging via the logging library. Hopefully this will allow more control on the level of logging when running models. Test Plan: ``` AOT_PARTITIONER_DEBUG=1 buck2 run @mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2 ``` Resulting output paste: P1674535630 * Full logs paste: P1674535621 ``` pastry P1674535621 | grep "functorch/partitioners.py" | pastry ``` Logging results: P1674549514 Differential Revision: D61678215 Pull Request resolved: pytorch#139782 Approved by: https://github.com/paryxyt, https://github.com/jansel

Update the torch-xpu-ops commit to [01f4e29](intel/torch-xpu-ops@01f4e29), includes: - Improve XPU operator coverage - Fix `Werror=comments` relevant building issues Pull Request resolved: pytorch#140277 Approved by: https://github.com/EikanWang, https://github.com/atalman

…ch#140571) When investigating the burst of 429 rate limit failures from docker.io yesterday, I found out that ` pytorch-linux-jammy-py3.12-triton-cpu` hasn't been added to docker build workflow at all. The bad effect is that the image is rebuilt on every job https://github.com/pytorch/pytorch/actions/runs/11808772774/job/32900628381 Pull Request resolved: pytorch#140571 Approved by: https://github.com/seemethere, https://github.com/wdvr

Remove the contiguous patch because it is no longer needed. Pull Request resolved: pytorch#140428 Approved by: https://github.com/titaiwangms

Fixes pytorch#123649 Use Manylinux 2_28 Docker builds for PyTorch Nightly builds This moves the wheels to a Docker image that uses : ``quay.io/pypa/manylinux_2_28_x86_64`` as a base rather then ``centos:7`` which is EOL on June 30, 2024. Information: https://github.com/pypa/manylinux#manylinux_2_28-almalinux-8-based manylinux_2_28 (AlmaLinux 8 based) Toolchain: GCC 13 Built wheels are also expected to be compatible with other distros using glibc 2.28 or later, including: Debian 10+ Ubuntu 18.10+ Fedora 29+ CentOS/RHEL 8+ This migration should enable us to migrate to latest CUDNN version, and land this PR: pytorch#137978 Pull Request resolved: pytorch#138732 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn

pytorch#140442) …al studio build tool is only needed for Windows I created no issue since the suggested change is actually very small. This is my very first PR so partly I am creating it just to dip my toes in the water. In fact I would understand if the change does not get accepted since it's a simple modification to part of the wording in the README. The wording as it currently stands is probably clear enough for most people, but I still missed the fact that visual studio build tool must only be installed for Windows (even though that is stated there), and I thought by adding some parentheses this might become even more clear, specially since elsewhere in the README the formatting makes it more explicit that some steps must only be run for Windows/Linux/MacOS As I said, it's a trivial change so I'd understand if it's not accepted, and I am looking forward to making more meaningful contributions as time goes on. Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#140442 Approved by: https://github.com/soulitzer

…if_needed (pytorch#140447) Summary: Rename generate_extern_kernel_alloc_and_find_schema_if_needed to better reflect its meaning. Pull Request resolved: pytorch#140447 Approved by: https://github.com/chenyang78

…0448) Summary: pytorch#139895 added data_ptr(), but there is a remaining place in cpp_wrapper_gpu.py didn't switch over. Also moved a few AtenTensorHandle related utility functions from arrayref_tensor.h to utils.h. Pull Request resolved: pytorch#140448 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#140447

…0659) Because there is `linux-focal-cuda12_1-py3_10-gcc9-bazel-test` Not sure what the purpose of testing it against 2 CUDA versions as very basic things are tested right now Pull Request resolved: pytorch#140659 Approved by: https://github.com/atalman, https://github.com/huydhn

…39588) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: pytorch#139588 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <[email protected]>

…rs (pytorch#138770) Previous to this PR, in torchlib TracedONNXFunction, the inputs could be python constants even if the annotation sets to TensorTypes. Pull Request resolved: pytorch#138770 Approved by: https://github.com/justinchuby

…ve not enabled decompose_mem_bound_mm for XPU. (pytorch#140517) Pull Request resolved: pytorch#140517 Approved by: https://github.com/EikanWang, https://github.com/jansel

…ses (pytorch#140540) Fix pytorch#140537 Pull Request resolved: pytorch#140540 Approved by: https://github.com/EikanWang, https://github.com/jansel ghstack dependencies: pytorch#140517

Summary: https://fb.workplace.com/groups/1286739428954016/posts/1370274947267130 Test Plan: ``` hg up b5b5adce34 vizard_projects/ml_depth/scripts/run_mld.sh ``` used to crash, no longer crashes Differential Revision: D65913100 Pull Request resolved: pytorch#140626 Approved by: https://github.com/ezyang

Related to pytorch#107302 We saw `test_float_to_int_conversion_nonfinite` failed as we upgrade to NumPy 2. It is caused by the undefined behavior of `numpy` casting `inf`, `-inf` and `nan` from `np.float32` to other dtypes. The test is using NumPy as reference for the ground truth. (see line 1013-1015) However, these behaviors are undefined in NumPy. If you do `np.array([float("inf")]).astype(np.uint8, casting="safe")`, it results in an error `TypeError: Cannot cast array data from dtype('float64') to dtype('uint8') according to the rule 'safe'`. The undefined behaviors are always subject to change. This PR address this issue by passing concrete values as the ground truth references. In the future, even NumPy changes its behavior the test would still remain stable. Pull Request resolved: pytorch#138131 Approved by: https://github.com/drisspg

sccache-0.2.14 can query it thru IMDSv1 and sccache-0.8.2 can do it thru v2 (or may be just use trust relationships between host and bucket Pull Request resolved: pytorch#140611 Approved by: https://github.com/wdvr

**About this PR** This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode. - `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor. - `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32 - `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output. **Test plan** `python test/test_quantization.py -k test_linear_dynamic_fp16_onednn` **Implementation** These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing. **Correctness and performance** Correctness is guaranteed by UT. Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused. For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase. Pull Request resolved: pytorch#140376 Approved by: https://github.com/jerryzh168, https://github.com/jgong5

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

Chao1Han changed the title ~~Support all2all_base~~ xccl-p2p Sep 18, 2024

afrittoli and others added 29 commits November 8, 2024 13:28

Enable inductor-rocm workflow for all trunk commits AND inductor-rela…

ee7c3db

…ted PRs (pytorch#138623) It should help with triaging ROCm-inductor-related breakages and surfacing them in the PRs itself. Pull Request resolved: pytorch#138623 Approved by: https://github.com/huydhn

[CD] Enable 3.13 triton build (pytorch#140137)

22cd1ee

Copied from pytorch#139652 Pull Request resolved: pytorch#140137 Approved by: https://github.com/huydhn, https://github.com/malfet

Fix traceback.format_exception(...) positional arguments error. (pyto…

dfcf740

…rch#140109) Fix pytorch#140095 Pull Request resolved: pytorch#140109 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison

Fix typo in associative_scan tests (pytorch#139929)

7fa94f0

Fix typo with Associative_Scan tests Pull Request resolved: pytorch#139929 Approved by: https://github.com/ydwu4

[dynamo] switch to get_framelocals_mapping for 3.11 (pytorch#139950)

bbd427f

Part of implementing pytorch#93753 Pull Request resolved: pytorch#139950 Approved by: https://github.com/jansel ghstack dependencies: pytorch#139921

[dynamo] Support method with different __self__ on user defined objec…

a140e65

…ts (pytorch#139953) Pull Request resolved: pytorch#139953 Approved by: https://github.com/jansel

[dynamo][guards] Profile guard manager in C++ (pytorch#140110)

e6c5a77

This should remove the pybind noise from the profiling. Pull Request resolved: pytorch#140110 Approved by: https://github.com/jansel ghstack dependencies: pytorch#139953

[AOTI] Update the OSS tutorial (pytorch#139956)

63a0d65

Summary: Update the OSS tutorial to use the new aoti_compile_and_package and aoti_load_package APIs. Pull Request resolved: pytorch#139956 Approved by: https://github.com/angelayi ghstack dependencies: pytorch#139955

[ez] Add .lintrunner.private.toml to .gitignore (pytorch#140166)

119e069

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#140166 Approved by: https://github.com/Skylion007

sraikund16 and others added 25 commits November 13, 2024 21:30

OpenReg: Export the number of devices (pytorch#140492)

a8de849

Export the number of devices so that it can be used in ut. Pull Request resolved: pytorch#140492 Approved by: https://github.com/ezyang

Refactor user-defined triton kernel source code collection (pytorch#1…

26fde11

…40577) Differential Revision: [D65895743](https://our.internmc.facebook.com/intern/diff/D65895743) Pull Request resolved: pytorch#140577 Approved by: https://github.com/zou3519

[ez] Fix check labels error when deleting comment (pytorch#140578)

08acfcd

Re make of pytorch#140587 Pull Request resolved: pytorch#140578 Approved by: https://github.com/huydhn

Add proper parse_tensor_constants support (pytorch#140558)

70060b0

Fixes pytorch#140422 Pull Request resolved: pytorch#140558 Approved by: https://github.com/davidberard98

Add support for parsing torch.Generator in JIT (pytorch#140489)

b34bb1f

Fixes pytorch#140420 Pull Request resolved: pytorch#140489 Approved by: https://github.com/davidberard98

[ONNX] Remove the contiguous patch (pytorch#140428)

f85e433

Remove the contiguous patch because it is no longer needed. Pull Request resolved: pytorch#140428 Approved by: https://github.com/titaiwangms

[AOTI][refactor] Rename generate_extern_kernel_alloc_and_find_schema_…

85deef9

…if_needed (pytorch#140447) Summary: Rename generate_extern_kernel_alloc_and_find_schema_if_needed to better reflect its meaning. Pull Request resolved: pytorch#140447 Approved by: https://github.com/chenyang78

[Inductor UT] Skip test_decompose_mem_bound_mm.py for XPU since we ha…

8d3a07e

…ve not enabled decompose_mem_bound_mm for XPU. (pytorch#140517) Pull Request resolved: pytorch#140517 Approved by: https://github.com/EikanWang, https://github.com/jansel

[Inductor UT] Fix duplicate registration of custom ops amount test ca…

3ce75e7

…ses (pytorch#140540) Fix pytorch#140537 Pull Request resolved: pytorch#140540 Approved by: https://github.com/EikanWang, https://github.com/jansel ghstack dependencies: pytorch#140517

Don't pass credentials explicitly to sccache (pytorch#140611)

99c8d5a

sccache-0.2.14 can query it thru IMDSv1 and sccache-0.8.2 can do it thru v2 (or may be just use trust relationships between host and bucket Pull Request resolved: pytorch#140611 Approved by: https://github.com/wdvr

Chao1Han added 3 commits November 14, 2024 06:54

Merge remote-tracking branch 'origin/main' into xccl-p2p

0aedc00

WA AVG reduction

65e0d9d

update test case

3e97e67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xccl-p2p #3

xccl-p2p #3

Chao1Han commented Sep 18, 2024

xccl-p2p #3

Are you sure you want to change the base?

xccl-p2p #3

Conversation

Chao1Han commented Sep 18, 2024