Releases: huggingface/accelerate
v0.26.0 - MS-AMP Support, Critical Regression Fixes, and More
Support for MS-AMP
This release adds support for the MS-AMP (Microsoft Automatic Mixed Precision Library) into Accelerate as an alternative backend for doing FP8 training on appropriate hardware. It is the default backend of choice. Read more in the docs here. Introduced in #2232 by @muellerzr
Core
In the prior release a new sampler for the DataLoader
was introduced that while across seeds does not show statistical differences in the results, repeating the same seed would result in a different end-accuracy that was scary to some users. We have now disabled this behavior by default as it required some additional setup, and brought back the original implementation. To have the new sampling technique (which can provide more accurate repeated results) pass use_seedable_sampler=True
to the Accelerator
. We will be propagating this up to the Trainer
soon.
Big Model Inference
- NPU support was added thanks to @statelesshz in #2222
- When generating an automatic
device_map
we've made it possible to not returned grouped key results if desired in #2233 - We now handle corner cases better when users pass
device_map="cuda"
etc thanks to @younesbelkada in #2254
FSDP and DeepSpeed
-
Many improvements to the docs have been made thanks to @stass. Along with this we've made it easier to adjust the config for the sharding strategy and other config values thanks to @pacman100 in #2288
-
A regression in Accelerate 0.23.0 occurred that showed learning is much slower on multi-GPU setups compared to a single GPU. #2304 has now fixed this thanks to @pacman100
-
The DeepSpeed integration now also handles
auto
values better when making a configuration in #2313
Bits and Bytes
Device Agnostic Testing
For developers, we've made it much easier to run the tests on different devices with no change to the code thanks to @statelesshz in #2123 and #2235
Bug Fixes
- Check notebook launcher for 3090+ by @muellerzr in #2212
- Fix dtype bug when
offload_state_dict=True
anddtype
is specified by @fxmarty in #2116 - fix tqdm wrapper to print when process id ==0 by @kashif in #2223
- fix BFloat16 is not supported on MPS (#2226) by @jxysoft in #2227
- Fix MpDeviceLoaderWrapper not having attribute batch_sampler by @vanbasten23 in #2242
- [deepspeed] fix setting
auto
values for comm buffers by @stas00 in #2295 - Fix infer_auto_device_map when tied weights share the same prefix name by @fxmarty in #2324
- Fixes bug in swapping weights when replacing with Transformer-Engine layers by @sudhakarsingh27 in #2305
- Fix breakpoint API in test_script.py on TPU. by @vanbasten23 in #2263
- Bring old seed technique back by @muellerzr in #2319
Major Contributors
- @statelesshz for their work on device-agnostic testing and NPU support
- @stas00 for many docfixes when it comes to DeepSpeed and FSDP
General Changelog
- add missing whitespace by @stas00 in #2206
- MNT Delete the delete doc workflows by @BenjaminBossan in #2217
- Update docker images by @muellerzr in #2213
- Add allgather check for xpu by @abhilash1910 in #2199
- Check notebook launcher for 3090+ by @muellerzr in #2212
- Fix dtype bug when
offload_state_dict=True
anddtype
is specified by @fxmarty in #2116 - fix tqdm wrapper to print when process id ==0 by @kashif in #2223
- [data_loader] expand the error message by @stas00 in #2221
- Update the 'Frameworks using Accelerate' section to include Amphion by @RMSnow in #2225
- [Docs] Add doc for cpu/disk offload by @SunMarc in #2231
- device agnostic testing by @statelesshz in #2123
- Make cleaning optional for device map by @muellerzr in #2233
- Add npu support to big model inference by @statelesshz in #2222
- fix the DS failing test by @pacman100 in #2237
- Fix nb tests by @muellerzr in #2230
- fix BFloat16 is not supported on MPS (#2226) by @jxysoft in #2227
- Fix MpDeviceLoaderWrapper not having attribute batch_sampler by @vanbasten23 in #2242
- [
Big-Modeling
] Harmonize device check to handle corner cases by @younesbelkada in #2254 - Support
log_images
for aim tracker by @Justin900429 in #2257 - Integrate MS-AMP Support for FP8 as a seperate backend by @muellerzr in #2232
- refactor deepspeed dataloader prepare logic by @pacman100 in #2238
- device agnostic deepspeed&fsdp testing by @statelesshz in #2235
- Solve CUDA issues by @muellerzr in #2272
- Uninstall DVC in the Trainer tests by @muellerzr in #2271
- Rm DVCLive from test reqs as latest version causes failures by @muellerzr in #2279
- typo fix by @stas00 in #2276
- Add condition before using
check_tied_parameters_on_same_device
by @SunMarc in #2218 - [doc] FSDP improvements by @stas00 in #2274
- [deepspeed docs] auto-values aren't being covered by @stas00 in #2286
- Improve FSDP config usability by @pacman100 in #2288
- [doc] language fixes by @stas00 in #2292
- Bump tj-actions/changed-files from 22.2 to 41 in /.github/workflows by @dependabot in #2300
- add back dvclive to tests by @dberenbaum in #2280
- Fixes bug in swapping weights when replacing with Transformer-Engine layers by @sudhakarsingh27 in #2305
- Fix breakpoint API in test_script.py on TPU. by @vanbasten23 in #2263
- make test_state_checkpointing device agnostic by @statelesshz in #2290
- [deepspeed] documentation by @stas00 in #2296
- Add more missing items by @muellerzr in #2309
- Update docs: Add warning for device_map=None for load_checkpoint_and_dispatch by @PhilJd in #2308
- [deepspeed] fix setting
auto
values for comm buffers by @stas00 in #2295 - DeepSpeed refactoring by @pacman100 in #2313
- Fix DeepSpeed related regression by @pacman100 in #2304
- Update test_deepspeed.py by @pacman100 in #2323
- Bring old seed technique back by @muellerzr in #2319
- Fix batch_size sanity check in
prepare_data_loader
by @izhx in #2310 Params4bit
added to bnb classes in set_module_tensor_to_device() by @poedator in #2315- Fix infer_auto_device_map when tied weights share the same prefix name by @fxmarty in #2324
New Contributors
- @fxmarty made their first contribution in #2116
- @RMSnow made their first contribution in #2225
- @jxysoft made their first contribution in #2227
- @vanbasten23 made their first contribution in #2242
- @Justin900429 made their first contribution in #2257
- @dependabot made their first contribution in #2300
- @sudhakarsingh27 ma...
v0.25.0: safetensors by default, new trackers, and plenty of bug fixes
Safetensors default
As of this release, safetensors
will be the default format saved when applicable! To read more about safetensors and why it's best to use it for safety (and not pickle/torch.save), check it out here
New Experiment Trackers
This release has two new experiment trackers, ClearML and DVCLive!
To use them, just pass clear_ml
or dvclive
to log_with
in the Accelerator
init. h/t to @eugen-ajechiloae-clearml and @dberenbaum
DeepSpeed
- Accelerate's DeepSpeed integration now supports NPU devices, h/t to @statelesshz
- DeepSpeed can now be launched via accelerate on single GPU setups
FSDP
FSDP had a huge refactoring so that the interface when using FSDP is the exact same as every other scenario when using accelerate
. No more needing to call accelerator.prepare()
twice!
Other useful enhancements
-
We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using
accelerate launch
we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error. -
When doing
.gather()
, if tensors are on different devices we explicitly will raise an error (for now only valid on CUDA)
Bug fixes
- Fixed a bug that caused dataloaders to not shuffle despite
shuffle=True
when using multiple GPUs and the newSeedableRandomSampler
.
General Changelog
- Add logs offloading by @SunMarc in #2075
- Add ClearML tracker by @eugen-ajechiloae-clearml in #2034
- CRITICAL: fix failing ci by @muellerzr in #2088
- Fix flag typo by @kuza55 in #2090
- Fix batch sampler by @muellerzr in #2097
- fixed ip address typo by @Fluder-Paradyne in #2099
- Fix memory leak in fp8 causing OOM (and potentially 3x vRAM usage) by @muellerzr in #2089
- fix warning when offload by @SunMarc in #2105
- Always use SeedableRandomSampler by @muellerzr in #2110
- Fix issue with tests by @muellerzr in #2111
- Make SeedableRandomSampler the default always by @muellerzr in #2117
- Use "and" instead of comma in Bibtex citation by @qgallouedec in #2119
- Add explicit error if empty batch received by @YuryYakhno in #2115
- Allow for ACCELERATE_SEED env var by @muellerzr in #2126
- add DeepSpeed support for NPU by @statelesshz in #2054
- Sync states for npu fsdp by @jq460494839 in #2113
- Fix import error when torch>=2.0.1 and torch.distributed is disabled by @natsukium in #2121
- Make safetensors the default by @muellerzr in #2120
- Raise error when saving with param on meta device by @SunMarc in #2132
- Leave native
save
asFalse
by @muellerzr in #2138 - fix retie_parameters by @SunMarc in #2137
- Deal with shared memory scenarios by @muellerzr in #2136
- specify config file path on README by @kwonmha in #2140
- Fix safetensors contiguous by @SunMarc in #2145
- Fix more tests by @muellerzr in #2146
- [docs] fixed a couple of broken links by @MKhalusova in #2147
- [docs] troubleshooting guide by @MKhalusova in #2133
- [Docs] fix doc typos by @kashif in #2150
- Add note about GradientState being in-sync with the dataloader by default by @muellerzr in #2134
- Deprecated runner stuff by @muellerzr in #2152
- Add examples to tests by @muellerzr in #2131
- Disable pypi for merge workflows + fix trainer tests by @muellerzr in #2153
- Adds dvclive tracker by @dberenbaum in #2139
- check port availability only in main deepspeed/torchrun launcher by @Jingru in #2078
- Do not attempt to pad nested tensors by @frankier in #2041
- Add warning for problematic libraries by @muellerzr in #2151
- Add ZeRO++ to DeepSpeed usage docs by @SumanthRH in #2166
- Fix Megatron-LM Arguments Bug by @yuanenming in #2168
- Fix non persistant buffer dispatch by @SunMarc in #1941
- Updated torchrun instructions by @TJ-Solergibert in #2096
- New CI Runners by @muellerzr in #2087
- Revert "New CI Runners" by @muellerzr in #2172
- [Working again] New CI by @muellerzr in #2173
- fsdp refactoring by @pacman100 in #2177
- Pin DVC by @muellerzr in #2196
- Apply DVC warning to Accelerate by @muellerzr in #2197
- Explicitly disable P2P using
launch
, and pick up instate
if a user will face issues. by @muellerzr in #2195 - Better error when device mismatches when calling gather() on CUDA by @muellerzr in #2180
- unpins dvc by @dberenbaum in #2200
- Assemble state dictionary for offloaded models by @blbadger in #2156
- Allow deepspeed without distributed launcher by @pacman100 in #2204
New Contributors
- @eugen-ajechiloae-clearml made their first contribution in #2034
- @kuza55 made their first contribution in #2090
- @Fluder-Paradyne made their first contribution in #2099
- @YuryYakhno made their first contribution in #2115
- @jq460494839 made their first contribution in #2113
- @kwonmha made their first contribution in #2140
- @dberenbaum made their first contribution in #2139
- @Jingru made their first contribution in #2078
- @frankier made their first contribution in #2041
- @yuanenming made their first contribution in #2168
- @TJ-Solergibert made their first contribution in #2096
- @blbadger made their first contribution in #2156
Full Changelog: v0.24.1...v0.25.0
v0.24.1: Patch Release for Samplers
- Fixes #2091 by changing how checking for custom samplers is done
v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements
Improved Reproducibility
One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch()
function to all Accelerate
DataLoaders
, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:
dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)
For more information see this PR, we will update the docs on a subsequent release with more information on this API.
Documentation
- The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
- We also now have documentation on how to perform multinode training, see the launch docs
Internal structure
- Shared file systems are now supported under
save
andsave_state
via theProjectConfiguration
dataclass. See #1953 for more info. - FSDP can now be used for
bfloat16
mixed precision viatorch.autocast
all_gather_into_tensor
is now used as the main gather operation, reducing memory in the cases of big tensors- Specifying
drop_last=True
will now properly have the desired affect when performingAccelerator().gather_for_metrics()
What's Changed
- Update big_modeling.md by @kli-casia in #1976
- Fix model copy after
dispatch_model
by @austinapatel in #1971 - FIX: Automatic checkpoint path inference issue by @BenjaminBossan in #1989
- Fix skip first batch for deepspeed example by @SumanthRH in #2001
- [docs] Quick tour refactor by @MKhalusova in #2008
- Add basic documentation for multi node training by @SumanthRH in #1988
- update torch_dynamo backends by @SunMarc in #1992
- Sync states for xpu fsdp by @abhilash1910 in #2005
- update fsdp docs by @pacman100 in #2026
- Enable shared file system with
save
andsave_state
via ProjectConfiguration by @muellerzr in #1953 - Fix save on each node by @muellerzr in #2036
- Allow FSDP to use with
torch.autocast
for bfloat16 mixed precision by @brcps12 in #2033 - Fix DeepSpeed version to <0.11 by @BenjaminBossan in #2043
- Unpin deepspeed by @muellerzr in #2044
- Reduce memory by using
all_gather_into_tensor
by @muellerzr in #1968 - Safely end training even if trackers weren't initialized by @Ben-Epstein in #1994
- Fix integration CI by @muellerzr in #2047
- Make fsdp ram efficient loading optional by @pacman100 in #2037
- Let drop_last modify
gather_for_metrics
by @muellerzr in #2048 - fix docstring by @zhangsibo1129 in #2053
- Fix stalebot by @muellerzr in #2052
- Add space to docs by @muellerzr in #2055
- Fix the error when the "train_batch_size" is absent in DeepSpeed config by @LZHgrla in #2060
- remove unused constants by @statelesshz in #2045
- fix: remove useless token by @rtrompier in #2069
- DOC: Fix broken link to designing a device map by @BenjaminBossan in #2073
- Let iterable dataset shard have a length if implemented by @muellerzr in #2066
- Allow for samplers to be seedable and reproducable by @muellerzr in #2057
- Fix docstring typo by @qgallouedec in #2072
- Warn when kernel version is too low on Linux by @BenjaminBossan in #2077
New Contributors
- @kli-casia made their first contribution in #1976
- @MKhalusova made their first contribution in #2008
- @brcps12 made their first contribution in #2033
- @Ben-Epstein made their first contribution in #1994
- @zhangsibo1129 made their first contribution in #2053
- @LZHgrla made their first contribution in #2060
- @rtrompier made their first contribution in #2069
- @qgallouedec made their first contribution in #2072
Full Changelog: v0.23.0...v0.24.0
v0.23.0: Model Memory Estimation tool, Breakpoint API, Multi-Node Notebook Launcher Support, and more!
Model Memory Estimator
A new model estimation tool to help calculate how much memory is needed for inference has been added. This does not download the pretrained weights, and utilizes init_empty_weights
to stay memory efficient during the calculation.
Usage directions:
accelerate estimate-memory {model_name} --library {library_name} --dtypes fp16 int8
Or:
from accelerate.commands.estimate import estimate_command_parser, estimate_command, gather_data
parser = estimate_command_parser()
args = parser.parse_args(["bert-base-cased", "--dtypes", "float32"])
output = gather_data(args)
🤗 Hub is a first-class citizen
We've made the huggingface_hub
library a first-class citizen of the framework! While this is mainly for the model estimation tool, this opens the doors for further integrations should they be wanted
Accelerator
Enhancements:
gather_for_metrics
will now also de-dupe for non-tensor objects. See #1937mixed_precision="bf16"
support on NPU devices. See #1949- New
breakpoint
API to help when dealing with trying to break from a condition on a single process. See #1940
Notebook Launcher Enhancements:
- The notebook launcher now supports launching across multiple nodes! See #1913
FSDP Enhancements:
- Activation checkpointing is now natively supported in the framework. See #1891
torch.compile
support was fixed. See #1919
DeepSpeed Enhancements:
- XPU/ccl support (#1827)
- Easier gradient accumulation support, simply set
gradient_accumulation_steps
to"auto"
in your deepspeed config, and Accelerate will use the one passed toAccelerator
instead (#1901) - Support for custom schedulers and deepspeed optimizers (#1909)
What's Changed
- Update release instructions by @sgugger in #1877
- fix detach_hook by @SunMarc in #1880
- Enable power users to bypass device_map="auto" training block by @muellerzr in #1881
- Introduce model memory estimator by @muellerzr in #1876
- Update with new url for explore by @muellerzr in #1884
- Enable a token to be used by @muellerzr in #1886
- Add doc on model memory usage by @muellerzr in #1887
- Add hub as core dep by @muellerzr in #1885
- update import of deepspeed integration from transformers by @pacman100 in #1894
- Final nits on model util by @muellerzr in #1896
- Fix nb launcher test by @muellerzr in #1899
- Add FSDP activation checkpointing feature by @arde171 in #1891
- Solve at least one failing test by @muellerzr in #1898
- Deepspeed integration for XPU/ccl by @abhilash1910 in #1827
- Add PR template by @muellerzr in #1906
- deepspeed grad_acc_steps fixes by @pacman100 in #1901
- Skip pypi transformers until release by @muellerzr in #1911
- Fix docker images by @muellerzr in #1910
- Use hosted CI runners for building docker images by @muellerzr in #1915
- fix: add debug argument to sagemaker configuration by @maximegmd in #1904
- improve help info when run
accelerate config
on npu by @statelesshz in #1895 - support logging with mlflow in case of mlflow-skinny installed by @ghtaro in #1874
- More CI fun - run all test parts always by @muellerzr in #1916
- Expose auto in dataclass by @muellerzr in #1914
- Add support for deepspeed optimizer and custom scheduler by @pacman100 in #1909
- reduce gradient first for XLA when unscaling the gradients in mixed precision training with AMP. by @statelesshz in #1926
- Check for invalid keys by @muellerzr in #1935
- clean num devices by @SunMarc in #1936
- Bring back pypi to runners by @muellerzr in #1939
- Support multi-node notebook launching by @ggaaooppeenngg in #1913
- fix the fsdp docs by @pacman100 in #1947
- Fix docs by @ggaaooppeenngg in #1951
- Protect tensorflow dependency by @SunMarc in #1959
- fix safetensor saving by @SunMarc in #1954
- FIX: patch_environment restores pre-existing environment variables when finished by @BenjaminBossan in #1960
- Better guards for slow imports by @muellerzr in #1963
- [
Tests
] Finish all todos by @younesbelkada in #1957 - Rm strtobool by @muellerzr in #1964
- Implementing gather_for_metrics with dedup for non tensor objects by @Lorenzobattistela in #1937
- add bf16 mixed precision support for NPU by @statelesshz in #1949
- Introduce breakpoint API by @muellerzr in #1940
- fix torch compile with FSDP by @pacman100 in #1919
- Add
force_hooks
todispatch_model
by @austinapatel in #1969 - update FSDP and DeepSpeed docs by @pacman100 in #1973
- Flex fix patch for accelerate by @abhilash1910 in #1972
- Remove checkpoints only on main process by @Kepnu4 in #1974
New Contributors
- @arde171 made their first contribution in #1891
- @maximegmd made their first contribution in #1904
- @ghtaro made their first contribution in #1874
- @ggaaooppeenngg made their first contribution in #1913
- @Lorenzobattistela made their first contribution in #1937
- @austinapatel made their first contribution in #1969
- @Kepnu4 made their first contribution in #1974
Full Changelog: v0.22.0...v0.23.0
v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more!
Experimental distributed operations checking framework
A new framework has been introduced which can help catch timeout
errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1"
to enable this. Read more in the docs, introduced via #1756
Accelerator.load_state
can now load the most recent checkpoint automatically
If a ProjectConfiguration
has been made, using accelerator.load_state()
(without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via #1741
Multiple enhancements to gradient accumulation
In this release multiple new enhancements to distributed gradient accumulation have been added.
accelerator.accumulate()
now supports passing in multiple models introduced via #1708- A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last
.backward()
via #1726
FSDP Changes
- FSDP support has been added for NPU and XPU devices via #1803 and #1806
- A new method for supporting RAM-efficient loading of models with FSDP has been added via #1777
DataLoader Changes
- Custom slice functions are now supported in the
DataLoaderDispatcher
added via #1846
What's New?
- fix failing test on 8GPU by @statelesshz in #1724
- Better control over DDP's
no_sync
by @NouamaneTazi in #1726 - Get rid of calling
get_scale()
by patching the step method of optimizer. by @yuxinyuan in #1720 - fix the bug in npu by @statelesshz in #1728
- Adding a shape check for
set_module_tensor_to_device
. by @Narsil in #1731 - Fix errors when optimizer is not a Pytorch optimizer. by @yuxinyuan in #1733
- Make balanced memory able to work with non contiguous GPUs ids by @thomwolf in #1734
- Fixed typo in
__repr__
of AlignDevicesHook by @KacperWyrwal in #1735 - Update docs by @muellerzr in #1736
- Fixed the bug that split dict incorrectly by @yuangpeng in #1742
- Let load_state automatically grab the latest save by @muellerzr in #1741
- fix
KwargsHandler.to_kwargs
not working withos.environ
initialization in__post_init__
by @CyCle1024 in #1738 - fix typo by @cauyxy in #1747
- Check for misconfiguration of single node & single GPU by @muellerzr in #1746
- Remove unused constant by @muellerzr in #1749
- Rework new constant for operations by @muellerzr in #1748
- Expose
autocast
kwargs and simplifyautocast
wrapper by @muellerzr in #1740 - Fix FSDP related issues by @pacman100 in #1745
- FSDP enhancements and fixes by @pacman100 in #1753
- Fix check failure in
Accelerator.save_state
using multi-gpu by @CyCle1024 in #1760 - Fix error when
max_memory
argument is in unexpected order by @ranchlai in #1759 - Fix offload on disk when executing on CPU by @sgugger in #1762
- Change
is_aim_available()
function to not match aim >= 4.0.0 by @alberttorosyan in #1769 - Introduce an experimental distributed operations framework by @muellerzr in #1756
- Support wrapping multiple models in Accelerator.accumulate() by @yuxinyuan in #1708
- Contigous on gather by @muellerzr in #1771
- [FSDP] Fix
load_fsdp_optimizer
by @awgu in #1755 - simplify and correct the deepspeed example by @pacman100 in #1775
- Set ipex default in state by @muellerzr in #1776
- Fix import error when torch>=2.0.1 and
torch.distributed
is disabled by @natsukium in #1800 - reserve 10% GPU in
get_balanced_memory
to avoid OOM by @ranchlai in #1798 - add support of float memory size in
convert_file_size_to_int
by @ranchlai in #1799 - Allow users to resume from previous wandb runs with
allow_val_change
by @SumanthRH in #1796 - Add FSDP for XPU by @abhilash1910 in #1803
- Add FSDP for NPU by @statelesshz in #1806
- Fix pytest import by @muellerzr in #1808
- More specific logging in
gather_for_metrics
by @dleve123 in #1784 - Detect device map auto and raise a helpful error when trying to not use model parallelism by @muellerzr in #1810
- Typo fix by @muellerzr in #1812
- Expand device-map warning by @muellerzr in #1819
- Update bibtex to reflect team growth by @muellerzr in #1820
- Improve docs on grad accumulation by @vwxyzjn in #1817
- add warning when using to and cuda by @SunMarc in #1790
- Fix bnb import by @muellerzr in #1813
- Update docs and docstrings to match
load_and_quantize_model
arg by @JonathanRayner in #1822 - Expose a bit of args/docstring fixup by @muellerzr in #1824
- Better test by @muellerzr in #1825
- Minor idiomatic change for fp8 check. by @float-trip in #1829
- Use device as context manager for
init_on_device
by @shingjan in #1826 - Ipex bug fix for device properties in modelling by @abhilash1910 in #1834
- FIX: Bug with
unwrap_model
andkeep_fp32_wrapper=False
by @BenjaminBossan in #1838 - Fix
verify_device_map
by @Rexhaif in #1842 - Change CUDA check by @muellerzr in #1833
- Fix the noneffective parameter:
gpu_ids
(Rel. Issue #1848) by @devymex in #1850 - support for ram efficient loading of model with FSDP by @pacman100 in #1777
- Loading logic safetensors by @SunMarc in #1853
- fix dispatch for quantized model by @SunMarc in #1855
- Update
fsdp_with_peak_mem_tracking
.py by @pacman100 in #1856 - Add env variable for
init_on_device
by @shingjan in #1852 - remove casting to FP32 when saving state dict by @pacman100 in #1868
- support custom slice function in
DataLoaderDispatcher
by @thevasudevgupta in #1846 - Include a note to the forums in the bug report by @muellerzr in #1871
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @yuxinyuan
- @NouamaneTazi
- Better control over DDP's
no_sync
(#1726)
- Better control over DDP's
- @abhilash1910
- @statelesshz
- @thevasudevgupta
- support custom slice function in
DataLoaderDispatcher
(#1846)
- support custom slice function in
Full Changelog: v0.21.0...v0.22.0
v0.21.0: Model quantization and NPUs
Model quantization with bitsandbytes
You can now quantize any model (no just Transformer models) using Accelerate. This is mainly for models having a lot of linear layers. See the documentation for more information!
Support for Ascend NPUs
Accelerate now supports Ascend NPUs.
- Add Ascend NPU accelerator support by @statelesshz in #1676
What's new?
Accelerate now requires Python 3.8+ and PyTorch 1.10+ :
-
🚨🚨🚨 Spring cleaning: Python 3.8 🚨🚨🚨 by @muellerzr in #1661
-
🚨🚨🚨 Spring cleaning: PyTorch 1.10 🚨🚨🚨 by @muellerzr in #1662
-
Update launch.mdx by @LiamSwayne in #1553
-
Avoid double wrapping of all accelerate.prepare objects by @muellerzr in #1555
-
Update README.md by @LiamSwayne in #1556
-
Fix load_state_dict when there is one device and disk by @sgugger in #1557
-
Fix tests not being ran on multi-GPU nightly by @muellerzr in #1558
-
fix the typo when setting the "_accelerator_prepared" attribute by @Yura52 in #1560
-
[
core
] Fix possibility to passNoneType
objects inprepare
by @younesbelkada in #1561 -
Reset dataloader end_of_datalaoder at each iter by @sgugger in #1562
-
Update big_modeling.mdx by @LiamSwayne in #1564
-
[
bnb
] Fix failing int8 tests by @younesbelkada in #1567 -
Update gradient sync docs to reflect importance of
optimizer.step()
by @dleve123 in #1565 -
Update mixed precision integrations in README by @sgugger in #1569
-
Raise error instead of warn by @muellerzr in #1568
-
Introduce listify, fix tensorboard silently failing by @muellerzr in #1570
-
Check for bak and expand docs on directory structure by @muellerzr in #1571
-
Perminant solution by @muellerzr in #1577
-
fix the bug in xpu by @mingxiaoh in #1508
-
Make sure that we only set is_accelerator_prepared on items accelerate actually prepares by @muellerzr in #1578
-
Expand
prepare()
doc by @muellerzr in #1580 -
Get Torch version using importlib instead of pkg_resources by @catwell in #1585
-
improve oob performance when use mpirun to start DDP finetune without
accelerate launch
by @sywangyi in #1575 -
Update training_tpu.mdx by @LiamSwayne in #1582
-
Return false if CUDA available by @muellerzr in #1581
-
Fix test by @muellerzr in #1586
-
Update checkpoint.mdx by @LiamSwayne in #1587
-
FSDP updates by @pacman100 in #1576
-
Integration tests by @muellerzr in #1593
-
Add triggers for CI workflow by @muellerzr in #1597
-
Remove asking xpu plugin for non xpu devices by @abhilash1910 in #1594
-
reset end_of_dataloader for dataloader_dispatcher by @megavaz in #1609
-
fix for arc gpus by @abhilash1910 in #1615
-
Ignore low_zero option when only device is available by @sgugger in #1617
-
Fix failing multinode tests by @muellerzr in #1616
-
Fix tb issue by @muellerzr in #1623
-
Fix workflow by @muellerzr in #1625
-
Fix transformers sync bug with accumulate by @muellerzr in #1624
-
fix: Megatron is not installed. please build it from source. by @yuanwu2017 in #1636
-
deepspeed z2/z1 state_dict bloating fix by @pacman100 in #1638
-
Swap disable rich by @muellerzr in #1640
-
fix autocasting bug by @pacman100 in #1637
-
fix modeling low zero by @abhilash1910 in #1634
-
Add skorch to runners by @muellerzr in #1646
-
Change dispatch_model when we have only one device by @SunMarc in #1648
-
Check for port usage before launch by @muellerzr in #1656
-
[
BigModeling
] Add missing check for quantized models by @younesbelkada in #1652 -
Bump integration by @muellerzr in #1658
-
TIL by @muellerzr in #1657
-
docker cpu py version by @muellerzr in #1659
-
[
BigModeling
] Final fix for dispatch int8 and fp4 models by @younesbelkada in #1660 -
remove safetensor dep on shard_checkpoint by @SunMarc in #1664
-
change the import place to avoid import error by @pacman100 in #1653
-
Update broken Runhouse link in examples/README.md by @dongreenberg in #1668
-
Add docs for saving Transformers models by @deppen8 in #1671
-
Fix workflow CI by @muellerzr in #1690
-
update readme in examples by @statelesshz in #1678
-
Fix nightly tests by @muellerzr in #1696
-
Fixup docs by @muellerzr in #1697
-
Improve quality errors by @muellerzr in #1698
-
Move mixed precision wrapping ahead of DDP/FSDP wrapping by @ChenWu98 in #1682
-
Deepcopy on Accelerator to return self by @muellerzr in #1694
-
Skip tests when bnb isn't available by @muellerzr in #1706
-
Fix launcher validation by @abhilash1910 in #1705
-
Fixes for issue #1683: failed to run accelerate config in colab by @Erickrus in #1692
-
Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait when the dataset is an IterDataPipe during multi-process training. by @yuxinyuan in #1709
-
Keep old behavior by @muellerzr in #1716
-
Optimize
get_scale
to reduce async calls by @muellerzr in #1718 -
Remove duplicate code by @muellerzr in #1717
-
New tactic by @muellerzr in #1719
-
add Comfy-UI by @pacman100 in #1723
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @LiamSwayne
- @mingxiaoh
- fix the bug in xpu (#1508)
- @statelesshz
- @ChenWu98
- Move mixed precision wrapping ahead of DDP/FSDP wrapping (#1682)
v0.20.3: Patch release
v0.20.2: Patch release
- fix the typo when setting the "_accelerator_prepared" attribute in #1560 by @Yura52
- [core] Fix possibility to pass]
NoneType
objects inprepare
in #1561 by @younesbelkada
v0.20.1: Patch release
- Avoid double wrapping of all accelerate.prepare objects by @muellerzr in #1555
- Fix load_state_dict when there is one device and disk by @sgugger in #1557