Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xccl group #2

Closed
wants to merge 1,085 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
1085 commits
Select commit Hold shift + click to select a range
062681a
[Profiler] Torch Profiler distributed info is not JSON serializable (…
sraikund16 Sep 13, 2024
1c04cbf
[BE] Use `C10_UNUSED` (#135914)
malfet Sep 13, 2024
e6b6835
Fix xpu memory stats error (#135818)
guangyey Sep 12, 2024
0ad5677
support allgather_into_tensor_coalesced
Chao1Han Sep 13, 2024
6cdc70b
[ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917)
pragupta Sep 13, 2024
0cdc6a8
[DSD] Fix distributed state dict full_state_dict option hang during s…
wz337 Sep 12, 2024
6df91b5
real tensor prop for composite ops (#135717)
pianpwk Sep 13, 2024
eea5e6f
[DCP][DSD] Add a test case to demonstrate the workaround to load full…
wz337 Sep 12, 2024
e54b559
[inductor] More fixes on the keys of `constants` and `signature` dict…
Jokeren Sep 13, 2024
b38be72
[Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556)
hoshibara Sep 13, 2024
9fd54d7
[Inductor UT] Generalize device-bias code in test_triton_kernels.py i…
etaf Sep 12, 2024
009e334
support reduce_scatter
Chao1Han Sep 13, 2024
7dc1788
[inductor] Remove the batch fusion passes from being a default (#135922)
anijain2305 Sep 13, 2024
ecbd989
refine test cases
Chao1Han Sep 13, 2024
a23ffb2
update ut
Chao1Han Sep 13, 2024
1d02dfe
add mpi check
Chao1Han Sep 13, 2024
c485bd8
update datatype map
Chao1Han Sep 13, 2024
2d1ae87
update
Chao1Han Sep 13, 2024
04226de
Merge branch 'xccl' into xccl-group
Chao1Han Sep 13, 2024
6184261
update
Chao1Han Sep 13, 2024
b346e99
remove fast_flush arguments (#135387)
int3 Sep 13, 2024
e504fb7
[Dynamo] Use custom backend to reenter metadata tf mode when tracing …
mlazos Sep 11, 2024
fafdd58
[Dynamo] Trace torch function modes entered outside of torch.compile …
mlazos Sep 12, 2024
30b007b
[Dynamo] Support thread local setattr (#135443)
mlazos Sep 12, 2024
0c080cb
[Dynamo] Simplify torch function mode stack guard (#135444)
mlazos Sep 12, 2024
2af3b8f
[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
mlazos Sep 12, 2024
7d5e0dd
[Dynamo] Remove ignored modes workaround (#135502)
mlazos Sep 12, 2024
c56728b
[Dynamo] Remove ignored modes from torch function mode stack guard (#…
mlazos Sep 12, 2024
31007cf
[Distributed] add FP8 support to NaN checker (#135891)
kwen2501 Sep 13, 2024
91d26d9
update
Chao1Han Sep 13, 2024
2f53d57
Update document for autocast on CPU (#135299)
CaoE Sep 13, 2024
ea2ecab
[AOTI][reland] Fix assert_function call in cpu autotune template (#13…
desertfire Sep 13, 2024
b5c52e9
Revert "[dynamo] Fix support for classmethod(property(...)) (#134968)"
pytorchmergebot Sep 13, 2024
1cdf658
Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to reso…
pytorchmergebot Sep 13, 2024
dc71e7a
Revert "[Dynamo] Remove ignored modes from torch function mode stack …
pytorchmergebot Sep 13, 2024
fca58bf
Revert "[Dynamo] Remove ignored modes workaround (#135502)"
pytorchmergebot Sep 13, 2024
ac16979
Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)"
pytorchmergebot Sep 13, 2024
4734e35
Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
pytorchmergebot Sep 13, 2024
3f30360
Revert "[Dynamo] Support thread local setattr (#135443)"
pytorchmergebot Sep 13, 2024
eb7dd91
Revert "[Dynamo] Trace torch function modes entered outside of torch.…
pytorchmergebot Sep 13, 2024
7ed0563
Revert "[Dynamo] Use custom backend to reenter metadata tf mode when …
pytorchmergebot Sep 13, 2024
ba6e0f3
Remove cycle dependency by localizing the import. (#135926)
laithsakka Sep 13, 2024
2519e5a
[CUDA][FP8] Skip rowwise scaling test on sm89 (#135718)
eqy Sep 13, 2024
21ffa18
Fix "expand: SymIntArrayRef expected to contain only concrete integer…
ezyang Sep 13, 2024
ad2f0e9
Add remote cache time saved to compilation metrics (#135490)
jamesjwu Sep 13, 2024
ae02d66
[FlexAttention] Fix output layout (#135882)
drisspg Sep 13, 2024
564d00f
Revert "Fix clang-tidy warnings in Caffe2 code (#134935)"
pytorchmergebot Sep 13, 2024
a157745
[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#1…
jataylo Sep 13, 2024
6ef49fe
Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache miss…
pytorchmergebot Sep 13, 2024
7834c0b
[AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inducto…
YUNQIUGUO Sep 13, 2024
bc0f330
[trymerge] Manually close merged PR when Github fails (#135890)
clee2000 Sep 13, 2024
18f9331
Revert "[aoti] Fix workspace generation for triton (#135552)"
pytorchmergebot Sep 13, 2024
3f69410
[gpu-profiler] Expose active and repeat in os env var (#135757)
dshi7 Sep 13, 2024
deee21c
Revert "[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.p…
pytorchmergebot Sep 13, 2024
b6d6aa4
Revert "Validate input types for `torch.nn.Linear` and `torch.nn.Bili…
pytorchmergebot Sep 13, 2024
835e7bb
fix requirements.txt installation failure issue on Windows (#134567)
jingxu10 Sep 13, 2024
b856f35
Fix script name in the comments (#135507)
kit1980 Sep 13, 2024
4312794
[reland][export] fix re-export custom metadata (#135720)
yiming0416 Sep 13, 2024
a3d827a
Use python 3.11 for Large Wheel build (#136042)
atalman Sep 13, 2024
2e461e5
Add gpu and gpu_dynamic versions of add_loop (#135809)
laithsakka Sep 13, 2024
4f407c1
Only measure compile time instruction count for sum_floordiv benchmar…
laithsakka Sep 13, 2024
46935c8
Reduce default iterations to 5 . (#135773)
laithsakka Sep 13, 2024
a30d5ba
Fix bug in split-build workflows codegen (#136043)
malfet Sep 13, 2024
db5e1b4
Fix inductor-micro-benchmark results upload (take 2) (#136052)
huydhn Sep 13, 2024
baff86d
[MTIA tensor] allow shallow copy between CPU and MTIA tensors (#135871)
jvandebon Sep 13, 2024
e2d3af4
[ONNX] Remove logging apis from public (#133825)
justinchuby Sep 13, 2024
3c5d44d
Cleanup unused runner variants (#136058)
ZainRizvi Sep 13, 2024
aad556a
[PT2][Inductor][Optimus] Fix a corner case in remove_split_with_size_…
mengluy0125 Sep 13, 2024
b8eef50
Fix attr check for quantization spec (#135736)
jerryzh168 Sep 12, 2024
b608ff3
[Easy] Dont match to mm_plus_mm if not in max autotune (#135929)
eellison Sep 13, 2024
a00faf4
[3.13] fix 3.13 pickle error in serialization.py (#136034)
williamwen42 Sep 13, 2024
4237592
[Distributed] add pack-check method for float8_e4m3fn (#135961)
kwen2501 Sep 13, 2024
081c4a9
[BE] Use squeeze/unsqueeze in im2col (#136006)
malfet Sep 14, 2024
06bc717
Fix sum() forward for NJT (#131945)
jbschlosser Sep 13, 2024
2a83d68
update
Chao1Han Sep 14, 2024
7f62b86
update
Chao1Han Sep 14, 2024
5de4cb8
[Inductor UT] Generalize inductor UT for intel GPU (Part 3) (#135827)
hoshibara Sep 14, 2024
95496e4
[CI] Check that PyTorch is built with OpenMP (#136060)
malfet Sep 14, 2024
2e8d431
Fix tensor.data_ptr() representation overflow (#135567)
guangyey Sep 10, 2024
c48f5eb
Support reduce_scatter_base
Chao1Han Sep 14, 2024
51c5206
Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of Shard…
CaoE Sep 14, 2024
1786a17
Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads …
pytorchmergebot Sep 14, 2024
731b178
[Dynamo] Use custom backend to reenter metadata tf mode when tracing …
mlazos Sep 13, 2024
4528777
[Dynamo] Trace torch function modes entered outside of torch.compile …
mlazos Sep 13, 2024
149d0b7
[Dynamo] Support thread local setattr (#135443)
mlazos Sep 13, 2024
ce3c74f
[Dynamo] Simplify torch function mode stack guard (#135444)
mlazos Sep 13, 2024
7743149
[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
mlazos Sep 13, 2024
5c67cf1
[Dynamo] Remove ignored modes workaround (#135502)
mlazos Sep 13, 2024
e77bd0e
[Dynamo] Remove ignored modes from torch function mode stack guard (#…
mlazos Sep 13, 2024
9b17dc4
Support reduce_scatter_tensor_coalesced
Chao1Han Sep 14, 2024
6cb3227
support barrier
Chao1Han Sep 14, 2024
911a43f
[TCPStore] Remove deprecated constructor (#136004)
fduwjj Sep 13, 2024
b9b6094
[ROCm] Skip pointwise associative scan tests due to regression (#135995)
jataylo Sep 14, 2024
e59f051
Merge branch 'xccl' into xccl-group
Chao1Han Sep 14, 2024
1a67e2b
[MPS] Add native im2col (#135706)
malfet Sep 14, 2024
d858c81
update
Chao1Han Sep 14, 2024
fea20f5
update
Chao1Han Sep 14, 2024
44dd218
Disable garbage collection during compile_time_instructions count in …
laithsakka Sep 13, 2024
a9bef85
[CI] Increase open file handles limit to 16K on MacOS (#136061)
malfet Sep 14, 2024
5a2be19
[Traceable FSDP2] Don't register RegisterPostBackwardFunction if user…
yf225 Sep 13, 2024
3352c9a
Add higher order operator name to the cache bypass exception (#135876)
oulgen Sep 13, 2024
e0e27f3
update
Chao1Han Sep 14, 2024
a815611
[Traceable FSDP2][Partitioner] Must save AC output if output has a ba…
yf225 Sep 14, 2024
f96a073
Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of Shard…
CaoE Sep 14, 2024
41b58a1
OpenReg: Fix issue when copying on the same device (#135956)
Zhenbin-8 Sep 14, 2024
72b868d
Revert "[Dynamo] Remove ignored modes from torch function mode stack …
pytorchmergebot Sep 14, 2024
838c912
Revert "[Dynamo] Remove ignored modes workaround (#135502)"
pytorchmergebot Sep 14, 2024
f3180f0
Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)"
pytorchmergebot Sep 14, 2024
7975ec3
Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
pytorchmergebot Sep 14, 2024
46f5037
Revert "[Dynamo] Support thread local setattr (#135443)"
pytorchmergebot Sep 14, 2024
8c8a308
Revert "[Dynamo] Trace torch function modes entered outside of torch.…
pytorchmergebot Sep 14, 2024
23dec79
Revert "[Dynamo] Use custom backend to reenter metadata tf mode when …
pytorchmergebot Sep 14, 2024
db393fb
Add Half support for reflection and replication padding on CPU (#135931)
CaoE Sep 14, 2024
f97cccf
[3.13] fix 3.13 pickle error in torch/package (#136049)
williamwen42 Sep 13, 2024
b863750
[Pytorch] Consolidate Strobelight compile time profiler between OSS a…
kollasb Sep 14, 2024
b82122b
Only keep ListOfLinears module in basic_modules_benchmarks and add gp…
laithsakka Sep 13, 2024
b4c84c3
[AOTI] Fix a fallback op returning None issue (#135997)
desertfire Sep 13, 2024
228760b
[Dynamo] Use custom backend to reenter metadata tf mode when tracing …
mlazos Sep 14, 2024
5c5c33a
[Dynamo] Trace torch function modes entered outside of torch.compile …
mlazos Sep 14, 2024
14cabdf
[Dynamo] Support thread local setattr (#135443)
mlazos Sep 14, 2024
06caa2d
[Dynamo] Simplify torch function mode stack guard (#135444)
mlazos Sep 14, 2024
1b9daeb
[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
mlazos Sep 14, 2024
860838e
[Dynamo] Remove ignored modes workaround (#135502)
mlazos Sep 14, 2024
8df01c8
[Dynamo] Remove ignored modes from torch function mode stack guard (#…
mlazos Sep 14, 2024
7f5abb4
[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat…
Skylion007 Sep 14, 2024
c64ae60
[dynamo] Fix support for classmethod(property(...)) (#134968)
jansel Sep 14, 2024
55299cf
[BE]: Update mypy to 1.11.2 (#133816)
Skylion007 Sep 14, 2024
e498b02
Add Triton CPU as an Inductor backend (#133408)
int3 Sep 13, 2024
426580a
Add CI for Triton CPU backend (#135342)
int3 Sep 13, 2024
5b21d91
Fix dividing Mul by factor (#136079)
isuruf Sep 14, 2024
391f2d6
use a fast expand algorithm (#135999)
isuruf Sep 13, 2024
a5eb43d
Add TensorReferenceAnalysis and some tests (#135886)
bobrenjc93 Sep 14, 2024
a1a57a4
Optimize dict reconstruct to not codegen untouched values (#134876)
guilhermeleobas Sep 12, 2024
8072ebc
SKIP llama for dynamic size testing (#135960)
leslie-fang-intel Sep 13, 2024
386884e
[Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Suppo…
yf225 Sep 14, 2024
e1abd34
[audio hash update] update the pinned audio hash (#136106)
pytorchupdatebot Sep 15, 2024
31e42a4
Fix redundant move warnings by g++ (#134987)
cyyever Sep 15, 2024
357b7fb
Revert "[Pytorch] Consolidate Strobelight compile time profiler betwe…
pytorchmergebot Sep 15, 2024
382fad5
Deprecate _preserve_ops and consolidate with decomp_table (#135080)
tugsbayasgalan Sep 14, 2024
1904b09
Create export_for_inference API and expose core_aten as public facing…
tugsbayasgalan Sep 15, 2024
dec3403
Add some doc for export_for_training (#135918)
tugsbayasgalan Sep 15, 2024
a141c6b
[pytorch][monitoring] Dynamic backend for WaitCounter (#135967)
andriigrynenko Sep 15, 2024
ab9a7ea
Add decomposition for permute_copy (#130944)
rec Sep 10, 2024
e501ed7
Update link in distributed.tensor.parallel.rst (#136103)
H-Huang Sep 15, 2024
d2207c5
[Distributed] add pack-check method for float8_e5m2 (#136115)
kwen2501 Sep 15, 2024
9961aaa
[dynamo] simplify implementation for `functools.reduce` (#133778)
XuehaiPan Sep 14, 2024
951c21d
[dynamo] simplify implementation for `builtins.sum` (#133779)
XuehaiPan Sep 14, 2024
3117f2c
Revert "[BE]: Update mypy to 1.11.2 (#133816)"
pytorchmergebot Sep 16, 2024
bbc3fdb
Add python 3.13.0t build to Docker images (#136001)
atalman Sep 16, 2024
a803cb0
[AOTI] Refactor how cpp_wrapper specific options are set (#136035)
desertfire Sep 14, 2024
d833f49
[reland][Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.p…
desertfire Sep 16, 2024
13bd125
Delete stable prototype (#135911)
bigfootjon Sep 16, 2024
c33b058
Add decomposition for squeeze_copy (#130941)
rec Sep 16, 2024
090046b
[effects] Turn off dtype promotion for with_effects lowering (#136039)
IvanKobzarev Sep 13, 2024
0aa41eb
[ONNX] Run type promotion test in CI and update the table (#135915)
justinchuby Sep 16, 2024
b491e29
[BE][Ez]: Add full half/bfloat16 dtype for `unique` and `isin` (#136114)
Skylion007 Sep 16, 2024
0199fd4
Revert "[inductor] More fixes on the keys of `constants` and `signatu…
pytorchmergebot Sep 16, 2024
5193f23
[Pytorch] Cleanup Strobelight URL and shorten for readability (#136102)
kollasb Sep 16, 2024
23c0d26
[BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091)
Skylion007 Sep 16, 2024
7fe004f
Revert "Add CI for Triton CPU backend (#135342)"
pytorchmergebot Sep 16, 2024
d0cebed
Revert "Add Triton CPU as an Inductor backend (#133408)"
pytorchmergebot Sep 16, 2024
d3647d1
Remove accidentally committed code (#136154)
malfet Sep 16, 2024
f89ce4d
`torch.nn.MultiheadAttention`: docs: improvement (#136111)
kuraga Sep 16, 2024
717fca2
Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146)
eugenekoran Sep 16, 2024
c977bb7
[Distributed] fix FileSystemWriter __init__ (#136135)
kwen2501 Sep 16, 2024
38caf10
[EZ] Fix spelling typo (#136157)
malfet Sep 16, 2024
31715be
[BE]: Update mypy to 1.11.2 (#133816)
Skylion007 Sep 16, 2024
7537f74
Refactor FxGraphCache.load into separate functions, so that AOTAutogr…
jamesjwu Sep 16, 2024
a0c7029
[c10d][Reland] Remove Option for ProcessGroup and Expose backend Opti…
fduwjj Sep 11, 2024
abd16a8
[torch/multiprocessing] Use multiprocessing.reduction.register Forki…
kiukchung Sep 16, 2024
3c97b0a
Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499)
dsjohns2 Sep 16, 2024
bfbcdf4
Revert "[dynamo] Fix support for classmethod(property(...)) (#134968)"
pytorchmergebot Sep 16, 2024
b76d1b7
Add scaling arguments to bsr_dense_addmm (#136104)
pearu Sep 16, 2024
c12536b
[ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#…
justinchuby Sep 16, 2024
071da87
use csv extention for test report in order for it to be uploaded to s…
laithsakka Sep 16, 2024
37a08b3
Revert "fix compiled_autograd deadlock throw (#135795)"
pytorchmergebot Sep 16, 2024
3f74310
Back out "Flip triton kernel default layout constraint to "needs_fixe…
tissue3 Sep 17, 2024
d463a81
inductor: dont use default_dtype during rng functionalization (#136041)
bdhirsh Sep 14, 2024
dc82d27
make view.dtype always return an alias (#136074)
bdhirsh Sep 14, 2024
408fe41
[DSD][EZ] Minor update in _state_dict_utils.py (#136165)
wz337 Sep 16, 2024
e248c1d
Update real device in FSDP state_dict_utils (#134994)
ankurneog Sep 17, 2024
3b5e268
Revert "Optimize dict reconstruct to not codegen untouched values (#1…
pytorchmergebot Sep 17, 2024
2c4ae81
Revert "Add decomposition for squeeze_copy (#130941)"
pytorchmergebot Sep 17, 2024
462b727
Revert "Add decomposition for permute_copy (#130944)"
pytorchmergebot Sep 17, 2024
913f97e
Don't run reshape pattern match on dynamic shape size tensor (#136100)
ezyang Sep 17, 2024
ece8267
Add back optim type hints that were lost when *.pyi files were remove…
mauvilsa Sep 17, 2024
67b14ce
[ONNX] Fix numpy method to return the correct type (#136162)
justinchuby Sep 17, 2024
63dc5df
[Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardwa…
ng-05 Sep 17, 2024
8e5bb35
[PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527)
huxintong Sep 17, 2024
cc365fd
[MTIA] Support torch.cuda.get_device_capability equivalent API on MTI…
ttrung149 Sep 17, 2024
785e987
Delete links to non-existing `run_plan_mpi.cc` (#136204)
malfet Sep 17, 2024
a838284
Support rms_norm() for NJT (#135872)
jbschlosser Sep 17, 2024
ea10c07
[export] Deserialize args with python keyword names (#136036)
angelayi Sep 17, 2024
a4e9a1c
[TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between E…
TroyGarden Sep 17, 2024
e3aa5e2
[NCCL] Don't override `waitUntilInitialized`'s setting of `comm->init…
eqy Sep 17, 2024
48d18fb
[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better …
banitag1 Sep 17, 2024
a575ce0
[PyTorch Pinned Allocator] Add support of background thread to proces…
banitag1 Sep 17, 2024
f6f1504
[MPS] Fix 5D+ reductions over negative dimentions (#136198)
malfet Sep 17, 2024
cccf500
[c10d] remove sleep from watchdogHandler (#135760)
c-p-i-o Sep 18, 2024
b18ba94
[AO][Inductor] Enable WOQ fusion pattern with permute (#135928)
leslie-fang-intel Sep 13, 2024
6682327
[BE] Make `NestedTensorTransformerFunctions.cu` compilable without wa…
malfet Sep 18, 2024
029026d
add ut
Chao1Han Sep 18, 2024
8895f69
[torch/numpy][numpy2.0 compat] Additional changes for tests to run u…
kiukchung Sep 18, 2024
9aa22ea
[CI] Make linux-aarch64 shards actually running different tests (#136…
malfet Sep 18, 2024
a0207c8
[dynamo] Fix support for classmethod(property(...)) (#134968)
jansel Sep 17, 2024
083c914
Reland D62220158 (#136213)
mengluy0125 Sep 18, 2024
b5be4d8
Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161)
pragupta Sep 18, 2024
701ba52
[Inductor] Increase multiplier to 3 for Inductor AMP FP16 benchmark c…
jiayisunx Sep 13, 2024
c8d152c
Fix fast_expand recursion error (#136163)
isuruf Sep 16, 2024
6a6f5b2
Add _addmm_activation to lower precision cast policy on AutocastCPU (…
CaoE Sep 18, 2024
605f2d8
[PyTorch] Remove unnecessary include of c10/util/Exception.h in irang…
swolchok Sep 17, 2024
3efaa01
[c10d] Make test compatible for new pytest (#136158)
fduwjj Sep 17, 2024
bad6904
[ROCm] upgrade ROCm CI builds to py3.10 (#134108)
jataylo Sep 18, 2024
5a6ddbc
Extending the Pytorch vec backend for SVE (ARM) (#119571)
maajidkhann Sep 18, 2024
68a7246
[cuDNN][conv][A100] Bump tolerances for `vmap_autograd_grad` `conv2d`…
eqy Sep 18, 2024
aae68e2
Add wait counter for nccl abort (#136067)
atuljangra Sep 18, 2024
1a86d8a
Fix calling Add._from_args and Mul._from_args (#136143)
isuruf Sep 16, 2024
bc9597b
[Traceable FSDP2] Minor refactor to traceable FSDP2 unit tests (#136219)
yf225 Sep 18, 2024
f1ad680
[dynamo]Remove stream hardcoding in dynamo VariableBuilder (#131763)
siju-samuel Sep 18, 2024
b9a197d
[BE][MPS] Delete duplicated code in `View.mm` (#136295)
malfet Sep 18, 2024
068c80e
[BE][MPS] Fix deprecation warnings on MacOS 15.0 (#136292)
malfet Sep 18, 2024
f2b0fc8
Add uint16 support for observer (#136238)
jerryzh168 Sep 18, 2024
e037bb3
[dynamo] fix crash in InspectSignatureVariable (#136010)
williamwen42 Sep 17, 2024
7755176
Add type checks for Tensor.add_ (#135864)
DuyguA Sep 19, 2024
001dac2
use lintrunner format code
Chao1Han Sep 19, 2024
db80b98
XFAIL test_segfault (#136252)
huydhn Sep 19, 2024
f13b449
rm allgatherv align with nccl
Chao1Han Sep 19, 2024
156c2ac
update
Chao1Han Sep 19, 2024
908a568
Return unsafe_view instead of view from matmul when folding occurs (#…
jwieczorekhabana Sep 19, 2024
bce52d0
[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annot…
igorsugak Sep 19, 2024
4ea741d
Revert "Reland D62220158 (#136213)"
pytorchmergebot Sep 19, 2024
65df26f
[FSDP2] Fixed 2D mismatched grad placements (#136237)
awgu Sep 18, 2024
803ce50
Log structured logging overhead to dynamo compile (kinda) (#136142)
jamesjwu Sep 19, 2024
8d9c427
Type _sympy/functions.py [1/n] (#136205)
bobrenjc93 Sep 19, 2024
ccca3de
[ROCm] Enable Flex attention tests on AMD gpus (#136245)
jerrymannil Sep 19, 2024
49723a8
fix stride compare failed when size value equal to one in ForeachUtil…
Shan19900305 Sep 19, 2024
8cba0ec
[AOTI][Tooling][8/n] Add option to pinpoint kernel names in debug pri…
YUNQIUGUO Sep 19, 2024
b71802f
add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175)
laithsakka Sep 17, 2024
7bbdf87
[22/N] Fix clang-tidy warnings in jit (#134829)
cyyever Sep 19, 2024
172ecf7
DTensor: dont hash symint tensor input in propagate_tensor_meta (#136…
bdhirsh Sep 18, 2024
9b424aa
[CI][CUSPARSELT] Extend cusparselt installation script to support cud…
nWEIdia Sep 19, 2024
79fd17e
Merge branch 'xccl' into xccl-group
Chao1Han Sep 20, 2024
bebf530
TCPStoreLibUvBackend: trace operations (#136320)
d4l3k Sep 20, 2024
1dfa07e
passing FileTimerRequests.to_json() to log_debug_info_for_expired_tim…
felixsu2006 Sep 20, 2024
d45b015
Add deterministic path for CUDA `cumsum` (#136224)
kurtamohler Sep 20, 2024
fe0e9fb
Fix flaky SIGSEGV crash in test_profile_memory (#136304)
huydhn Sep 20, 2024
652da01
Xccl process group for Pytorch
Chao1Han Aug 29, 2024
0cb0016
Merge remote-tracking branch 'upstream/main' into xccl-bak
Chao1Han Sep 20, 2024
a71d69a
Align latest
Chao1Han Sep 20, 2024
a1c2d6b
Merge branch 'xccl-bak' into xccl-group
Chao1Han Sep 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Suppo…
…rt FSDP2 + AC (pytorch#134997)

> Ignore FSDP2 forward hook side-effects in AC

Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job:
https://github.com/pytorch/pytorch/blob/451eaf0ff247090ca5a9648fd1e17c3c011737e1/torch/distributed/_composable/fsdp/_fsdp_state.py#L219-L220

So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation.

----

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`

Pull Request resolved: pytorch#134997
Approved by: https://github.com/zou3519
ghstack dependencies: pytorch#135727
  • Loading branch information
yf225 authored and pytorchmergebot committed Sep 15, 2024
commit 386884e5534bc812e4f90dcc94d420e148f20f2b
106 changes: 79 additions & 27 deletions test/distributed/_composable/fsdp/test_fully_shard_compile.py
Original file line number Diff line number Diff line change
@@ -154,25 +154,64 @@ def f(x):
torch.compile(f, backend="aot_eager")(x)
self.assertEqual(x, ref_x)

def _assert_no_aliased_graph_inputs(self, graph: torch.fx.Graph) -> None:
def _assert_no_aliased_unsharded_params_in_graph_inputs(
self, model, graph: torch.fx.Graph
) -> None:
# FSDP2 unsharded params are mutated in the graph without going through functionalization.
# Therefore, we want to make sure they don't have aliases in the graph inputs, to make it easier
# for us to do the replacement of unsharded params with the all-gathered temporary buffer directly
# in downstream users in the graph.
storage_id_to_graph_inputs = defaultdict(list)
unsharded_param_graph_inputs = set()
for node in graph.nodes:
if node.op == "placeholder" and isinstance(
node.meta.get("val", None), torch.Tensor
if (
node.op == "call_function"
and node.target
in [
torch.ops.inductor.resize_storage_bytes_.default,
torch.ops.fsdp.copy_.default,
]
and node.args[0].op == "placeholder"
):
storage_id_to_graph_inputs[
id(node.meta["val"].untyped_storage())
].append(node)
no_aliased_graph_inputs = True
unsharded_param_graph_inputs.add(node.args[0])
assert len(unsharded_param_graph_inputs) > 0
assert len(unsharded_param_graph_inputs) == len(
list(model.parameters())
), """\
Expected all model parameters to be wrapped by FSDP2 and
have their unsharded version as graph input, but it's not true!
"""
no_aliased_unsharded_params_in_graph_inputs = True
err_msg = ""
for aliased_graph_inputs in storage_id_to_graph_inputs.values():
if len(aliased_graph_inputs) > 1:
no_aliased_graph_inputs = False
if len(aliased_graph_inputs) > 1 and any(
x in unsharded_param_graph_inputs for x in aliased_graph_inputs
):
no_aliased_unsharded_params_in_graph_inputs = False
err_msg += f"""\n
Found aliased graph inputs: {aliased_graph_inputs},
Found aliased unsharded param in graph inputs: {aliased_graph_inputs},
val.shape: {[node.meta['val'].shape for node in aliased_graph_inputs]},
"""
self.assertTrue(no_aliased_graph_inputs, err_msg)
self.assertTrue(no_aliased_unsharded_params_in_graph_inputs, err_msg)

def _remove_fsdp2_unsharded_param_graph_input_usage_with_optional_checks(
self, model, fullgraph
):
def _run_with_checks(graph, orig_fn):
self._assert_no_aliased_unsharded_params_in_graph_inputs(model, graph)
orig_fn(graph)

if fullgraph:
return mock.patch.object(
comms,
"remove_fsdp2_unsharded_param_graph_input_usage",
functools.partial(
_run_with_checks,
orig_fn=comms.remove_fsdp2_unsharded_param_graph_input_usage,
),
)
else:
return contextlib.nullcontext()

def _check_fsdp_copy_and_resize_ops_count_in_graph(
self,
@@ -359,7 +398,11 @@ def inductor_code_check_fsdp_reduce_scatter(
return file_check

def _test_traceable_fsdp(
self, model_init_fn, input_creation_fn, backend, fullgraph
self,
model_init_fn,
input_creation_fn,
backend,
fullgraph,
):
def compiler_fn(compiled_autograd_backend):
def _fn(gm):
@@ -401,13 +444,18 @@ def test_compiled():
# FSDP2 does lazy init using 1st run, so run it once to init using eager mode
run_iters(model, optim, n_iter=1)

model_compiled = torch.compile(model, backend=backend, fullgraph=fullgraph)
res = run_iters(
model_compiled,
optim,
compiled_autograd_backend=backend,
)
return res
with self._remove_fsdp2_unsharded_param_graph_input_usage_with_optional_checks(
model, fullgraph
):
model_compiled = torch.compile(
model, backend=backend, fullgraph=fullgraph
)
res = run_iters(
model_compiled,
optim,
compiled_autograd_backend=backend,
)
return res

def test_eager():
model, optim = model_init_fn()
@@ -421,17 +469,15 @@ def test_eager():
inline_inbuilt_nn_modules=True,
skip_fsdp_hooks=False,
), torch._functorch.config.patch(
recompute_views=True, cse=False
recompute_views=True,
cse=False,
), torch._inductor.config.patch(
reorder_for_compute_comm_overlap=True,
reorder_for_compute_comm_overlap_passes=[
"sink_waits",
"raise_comms",
"reorder_compute_for_overlap",
],
post_grad_custom_pre_pass=self._assert_no_aliased_graph_inputs
if fullgraph
else None,
):
losses_compiled = test_compiled()
losses_eager = test_eager()
@@ -677,7 +723,9 @@ def test_nested_fully_shard_backend_inductor(self):
"Expected at least 3 separate lowerings to Triton code, which means at least 1 graph break in FWD graph",
)

def _create_transformer_factory_fns(self, all_requires_grad):
def _create_transformer_factory_fns(
self, all_requires_grad, *, activation_checkpoint=False
):
seq_len = 16
vocab_size = 8
n_layers = 3
@@ -689,6 +737,7 @@ def model_init_fn():
model_args = ModelArgs(
vocab_size=vocab_size,
n_layers=n_layers,
checkpoint_activations=activation_checkpoint,
)
model = Transformer(model_args)
if not all_requires_grad:
@@ -775,9 +824,11 @@ def test_transformer_backend_aot_eager_decomp_partition(self):
@torch._inductor.config.patch(fallback_random=True)
def test_transformer_backend_inductor(self):
# TODO: enable fullgraph=False case
for fullgraph, all_requires_grad in itertools.product([True], [True, False]):
for fullgraph, all_requires_grad, activation_checkpoint in itertools.product(
[True], [True, False], [True, False]
):
log.warning(
f"fullgraph={fullgraph}, all_requires_grad={all_requires_grad}" # noqa: G004, G001
f"fullgraph={fullgraph}, all_requires_grad={all_requires_grad}, activation_checkpoint={activation_checkpoint}" # noqa: G004, G001
)
with self._maybe_add_graph_break_to_sdpa(
fullgraph
@@ -802,7 +853,8 @@ def test_transformer_backend_inductor(self):
_, triton_codes = run_and_get_code(
lambda: self._test_traceable_fsdp(
*self._create_transformer_factory_fns(
all_requires_grad=all_requires_grad
all_requires_grad=all_requires_grad,
activation_checkpoint=activation_checkpoint,
),
"inductor",
fullgraph=fullgraph,
10 changes: 9 additions & 1 deletion torch/_dynamo/output_graph.py
Original file line number Diff line number Diff line change
@@ -326,7 +326,7 @@ def __init__(
] = collections.defaultdict(list)
# Stores the full fqn of a param or buffer to the relevant source.
self.param_name_to_source: Optional[Dict[str, Source]] = {}
self.side_effects = SideEffects()
self.side_effects = SideEffects(self)
# Cached variable trackers. This makes symbolic analysis of LOAD_GLOBAL
# and LOAD_ATTR for same python objects free.
self.variable_tracker_cache = VariableTrackerCache()
@@ -1834,6 +1834,14 @@ def __init__(
# Dicts maintain the order of args for the HigherOrderOperator call.
self.lifted_freevars = {}
self.prev_inst = None
# True if this tracer is currently tracing into torch.utils.checkpoint
# as part of speculate_subgraph.
self.under_activation_checkpoint = False
# True if we want to allow side-effects (doesn't throw error on their existence)
# during this tracer's tracing of torch.utils.checkpoint (via speculate_subgraph).
# Only safe if we know for sure that *NOT* replaying these side-effects during
# backward recomputation of the checkpoint region doesn't affect its correctness.
self.allow_side_effects_under_checkpoint = False

self._cur_code = None
self._orig_gm_meta = None
26 changes: 26 additions & 0 deletions torch/_dynamo/side_effects.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# mypy: allow-untyped-defs
import contextlib
import functools
import inspect
import warnings
import weakref
from collections.abc import MutableMapping
from typing import Any, Dict, List, Optional, Type, Union

@@ -79,13 +81,15 @@ class SideEffects:

def __init__(
self,
output_graph,
id_to_variable=None,
store_attr_mutations=None,
keepalive=None,
save_for_backward=None,
tensor_hooks=None,
):
super().__init__()
self.output_graph_weakref = weakref.ref(output_graph)
self.id_to_variable = id_to_variable or {}
self.store_attr_mutations = store_attr_mutations or {}
self.keepalive = keepalive or []
@@ -130,6 +134,7 @@ def diff(self, other: "SideEffects") -> Optional[str]:
def clone(self):
"""Create a shallow copy"""
return self.__class__(
output_graph=self.output_graph_weakref(),
id_to_variable=dict(self.id_to_variable),
store_attr_mutations={
k: dict(v) for k, v in self.store_attr_mutations.items()
@@ -145,13 +150,23 @@ def __contains__(self, item):
def __getitem__(self, item):
return self.id_to_variable[id(item)]

def should_allow_side_effects_under_checkpoint(self):
output_graph = self.output_graph_weakref()
return (
output_graph
and output_graph.current_tx.output.current_tracer.under_activation_checkpoint
and output_graph.current_tx.output.current_tracer.allow_side_effects_under_checkpoint
)

def check_allowed_side_effect(self, item):
from torch._dynamo.variables.misc import AutogradFunctionContextVariable

# People do things like self.dim = dim inside autograd.Function.
# These are benign.
if isinstance(item, AutogradFunctionContextVariable):
return True
if self.should_allow_side_effects_under_checkpoint():
return True
if not is_side_effect_safe(item.mutable_local):
unimplemented(
"HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)"
@@ -725,3 +740,14 @@ def is_empty(self):
def clear(self):
self.keepalive.clear()
self.id_to_variable.clear()


@contextlib.contextmanager
def allow_side_effects_under_checkpoint(tx: "InstructionTranslator"): # type: ignore[name-defined] # noqa: F821
assert tx.output.current_tracer.under_activation_checkpoint
orig_val = tx.output.current_tracer.allow_side_effects_under_checkpoint
try:
tx.output.current_tracer.allow_side_effects_under_checkpoint = True
yield
finally:
tx.output.current_tracer.allow_side_effects_under_checkpoint = orig_val
15 changes: 14 additions & 1 deletion torch/_dynamo/variables/functions.py
Original file line number Diff line number Diff line change
@@ -322,7 +322,20 @@ def call_function(
return invoke_and_store_as_constant(
tx, self.fn, self.get_name(), args, kwargs
)

if (
tx.output.current_tracer.under_activation_checkpoint
and not tx.output.current_tracer.allow_side_effects_under_checkpoint
):
try:
from torch.distributed._composable.fsdp._fsdp_state import FSDPState
except Exception:
FSDPState = None
if FSDPState is not None and self.fn in [
FSDPState._pre_forward,
FSDPState._post_forward,
]:
with torch._dynamo.side_effects.allow_side_effects_under_checkpoint(tx):
return super().call_function(tx, args, kwargs)
return super().call_function(tx, args, kwargs)


32 changes: 29 additions & 3 deletions torch/_dynamo/variables/higher_order_ops.py
Original file line number Diff line number Diff line change
@@ -71,6 +71,16 @@ def dynamo_enable_grad(tx: "InstructionTranslator", enable=True):
GradModeVariable.create(tx, org_value, initialized=True)


@contextlib.contextmanager
def dynamo_under_activation_checkpoint(tx: "InstructionTranslator"):
orig_val = tx.output.current_tracer.under_activation_checkpoint
try:
tx.output.current_tracer.under_activation_checkpoint = True
yield
finally:
tx.output.current_tracer.under_activation_checkpoint = orig_val


def only_consist_of(var, types, allow_none=False):
if isinstance(var, types):
return True
@@ -388,6 +398,7 @@ def speculate_subgraph(
set_subgraph_inputs="automatic",
restore_side_effects=True,
should_flatten_outputs=False,
under_activation_checkpoint=False,
# Pass in an originating tracer - this is needed for preserving context
# across fwd-bwd for autograd.Function
tracer=None,
@@ -439,6 +450,11 @@ def speculate_subgraph(
if enable_grad is not None
else contextlib.nullcontext()
)
checkpoint_ctx = (
dynamo_under_activation_checkpoint(tx)
if under_activation_checkpoint
else contextlib.nullcontext()
)

# For handling side effects, we can make an argument that we don't
# have to do anything here. The side effects infra does a good job
@@ -458,7 +474,7 @@ def speculate_subgraph(
if restore_side_effects:
prev_side_effects = tx.output.side_effects.clone()

with autograd_ctx:
with autograd_ctx, checkpoint_ctx:
output = f.call_function(tx, args, sub_kwargs)

if restore_side_effects:
@@ -1504,7 +1520,12 @@ def call_function(

class WrapHigherOrderVariable(TorchHigherOrderOperatorVariable):
def create_wrapped_node(
self, tx: "InstructionTranslator", args, kwargs, description
self,
tx: "InstructionTranslator",
args,
kwargs,
description,
under_activation_checkpoint=False,
):
# See NOTE [HigherOrderOperator tracing design] for more details

@@ -1520,6 +1541,7 @@ def create_wrapped_node(
description,
source_target=self.value,
should_flatten_outputs=True,
under_activation_checkpoint=under_activation_checkpoint,
)

body_gmod = torch.fx.GraphModule(tx.output.nn_modules, body_graph)
@@ -1856,7 +1878,11 @@ def call_function(
treespec,
checkpointed_gmod,
) = self.create_wrapped_node(
tx, args, gmod_kwargs, "torch.utils.checkpoint.checkpoint"
tx,
args,
gmod_kwargs,
"torch.utils.checkpoint.checkpoint",
under_activation_checkpoint=True,
)
if context_fn is not None:
checkpointed_gmod.meta["_checkpoint_context_fn"] = context_fn