Skip to content

Releases: huggingface/accelerate

v1.3.0 Bug fixes + Require torch 2.0

17 Jan 15:56
Compare
Choose a tag to compare

Torch 2.0

As it's been ~2 years since torch 2.0 was first released, we are now requiring this as the minimum version for Accelerate, which similarly was done in transformers as of its last release.

Core

  • [docs] no hard-coding cuda by @faaany in #3270
  • fix load_state_dict for npu by @ji-huazhong in #3211
  • Add keep_torch_compile param to unwrap_model and extract_model_from_parallel for distributed compiled model. by @ggoggam in #3282
  • [tests] make cuda-only test case device-agnostic by @faaany in #3340
  • latest bnb no longer has optim_args attribute on optimizer by @winglian in #3311
  • add torchdata version check to avoid "in_order" error by @faaany in #3344
  • [docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in #3296
  • dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in #3346
  • feat(tpu): remove nprocs from xla.spawn by @tengomucho in #3324

Big Modeling

Examples

  • Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in #3193

Full Changelog

What's Changed

New Contributors

Full Changelog: v1.2.1...v1.3.0

v1.2.1: Patchfix

13 Dec 18:56
Compare
Choose a tag to compare
  • fix: add max_memory to _init_infer_auto_device_map's return statement in #3279 by @Nech-C
  • fix load_state_dict for npu in #3211 by @statelesshz

Full Changelog: v1.2.0...v1.2.1

v1.2.0: Bug Squashing & Fixes across the board

13 Dec 18:47
Compare
Choose a tag to compare

Core

  • enable find_executable_batch_size on XPU by @faaany in #3236
  • Use numpy._core instead of numpy.core by @qgallouedec in #3247
  • Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
  • Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
  • [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
  • [data_loader] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246
  • use XPU instead of GPU in the accelerate config prompt text by @faaany in #3268

Big Modeling

  • Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model by @kylesayrs in #3217
  • Remove hook for bnb 4-bit by @SunMarc in #3223
  • [docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
  • Take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
  • Update deferring_execution.md by @max-yue in #3262
  • Revert default behavior of get_state_dict_from_offload by @kylesayrs in #3253
  • Fix: Resolve #3060, preload_module_classes is lost for nested modules by @wejoncy in #3248

DeepSpeed

  • Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
  • support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266

Documentation

New Contributors

Full Changelog

  • Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model by @kylesayrs in #3217
  • remove hook for bnb 4-bit by @SunMarc in #3223
  • enable find_executable_batch_size on XPU by @faaany in #3236
  • take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
  • [docs] update code in tracking documentation by @faaany in #3235
  • Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
  • [data_loader] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246
  • [docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
  • Use numpy._core instead of numpy.core by @qgallouedec in #3247
  • Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
  • [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
  • use XPU instead of GPU in the accelerate config prompt text by @faaany in #3268
  • support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266
  • Update deferring_execution.md by @max-yue in #3262
  • Fix: Resolve #3257 by @as12138 in #3261
  • Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in #3259
  • Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
  • Revert default behavior of get_state_dict_from_offload by @kylesayrs in #3253
  • Fix: Resolve #3060, preload_module_classes is lost for nested modules by @wejoncy in #3248
  • [docs] update set-seed by @faaany in #3228
  • [docs] fix typo by @faaany in #3221
  • [docs] use real path for checkpoint by @faaany in #3220
  • Fixed multiple typos for Tutorials and Guides docs by @henryhmko in #3274

Code Diff

Release diff: v1.1.1...v1.2.0

v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes

01 Nov 15:30
Compare
Choose a tag to compare

Internals:

  • Allow for a data_seed argument in #3150
  • Trigger weights_only=True by default for all compatible objects when checkpointing and saving with torch.save in #3036
  • Handle negative values for dim input in pad_across_processes in #3114
  • Enable cpu bnb distributed lora finetune in #3159

DeepSpeed

  • Support torch dynamo for deepspeed>=0.14.4 in #3069

Megatron

  • update Megatron-LM plugin code to version 0.8.0 or higher in #3174

Big Model Inference

  • New has_offloaded_params utility added in #3188

Examples

  • Florence2 distributed inference example in #3123

Full Changelog

New Contributors

Full Changelog: v1.0.1...v1.1.0

v1.0.1: Bugfix

12 Oct 03:01
Compare
Choose a tag to compare

Bugfixes

  • Fixes an issue where the auto values were no longer being parsed when using deepspeed
  • Fixes a broken test in the deepspeed tests related to the auto values

Full Changelog: v1.0.0...v1.0.1

Accelerate 1.0.0 is here!

07 Oct 15:42
Compare
Choose a tag to compare

🚀 Accelerate 1.0 🚀

With accelerate 1.0, we are officially stating that the core parts of the API are now "stable" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.

To read more, check out our official blog here

Migration assistance

  • Passing in dispatch_batches, split_batches, even_batches, and use_seedable_sampler to the Accelerator() should now be handled by creating an accelerate.utils.DataLoaderConfiguration() and passing this to the Accelerator() instead (Accelerator(dataloader_config=DataLoaderConfiguration(...)))
  • Accelerator().use_fp16 and AcceleratorState().use_fp16 have been removed; this should be replaced by checking accelerator.mixed_precision == "fp16"
  • Accelerator().autocast() no longer accepts a cache_enabled argument. Instead, an AutocastKwargs() instance should be used which handles this flag (among others) passing it to the Accelerator (Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)]))
  • accelerate.utils.is_tpu_available should be replaced with accelerate.utils.is_torch_xla_available
  • accelerate.utils.modeling.shard_checkpoint should be replaced with split_torch_state_dict_into_shards from the huggingface_hub library
  • accelerate.tqdm.tqdm() no longer accepts True/False as the first argument, and instead, main_process_only should be passed in as a named argument

Multiple Model DeepSpeed Support

After long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial here, however essentially:

When using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:

Knowledge distillation

(Where we train only one model, zero3, and another is used for inference, zero2)

from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")

deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}


accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)

To then select which plugin to be used at a certain time (aka when calling prepare), we call `accelerator.state.select_deepspeed_plugin("name"), where the first plugin is active by default:

accelerator.state.select_deepspeed_plugin("student")
student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)

accelerator.state.select_deepspeed_plugin("teacher") # This will automatically enable zero init
teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)

Multiple disjoint models

For disjoint models, separate accelerators should be used for each model, and their own .backward() should be called later:

for batch in dl:
    outputs1 = first_model(**batch)
    first_accelerator.backward(outputs1.loss)
    first_optimizer.step()
    first_scheduler.step()
    first_optimizer.zero_grad()
    
    outputs2 = model2(**batch)
    second_accelerator.backward(outputs2.loss)
    second_optimizer.step()
    second_scheduler.step()
    second_optimizer.zero_grad()

FP8

We've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily.

FSDP

  • Fixed FSDP auto_wrap using characters instead of full str for layers
  • Re-enable setting state dict type manually

Big Modeling

  • Removed cpu restriction for bnb training

What's Changed

New Contributors

Full Changelog: v0.34.2...v1.0.0

v0.34.1 Patchfix

05 Sep 15:36
Compare
Choose a tag to compare

Bug fixes

  • Fixes an issue where processed DataLoaders could no longer be pickled in #3074 thanks to @byi8220
  • Fixes an issue when using FSDP where default_transformers_cls_names_to_wrap would separate _no_split_modules by characters instead of keeping it as a list of layer names in #3075

Full Changelog: v0.34.0...v0.34.1

v0.34.0: StatefulDataLoader Support, FP8 Improvements, and PyTorch Updates!

03 Sep 14:58
Compare
Choose a tag to compare

Dependency Changes

  • Updated Safetensors Requirement: The library now requires safetensors version 0.4.3.
  • Added support for Numpy 2.0: The library now fully supports numpy 2.0.0

Core

New Script Behavior Changes

  • Process Group Management: PyTorch now requires users to destroy process groups after training. The accelerate library will handle this automatically with accelerator.end_training(), or you can do it manually using PartialState().destroy_process_group().
  • MLU Device Support: Added support for saving and loading RNG states on MLU devices by @huismiling
  • NPU Support: Corrected backend and distributed settings when using transfer_to_npu, ensuring better performance and compatibility.

DataLoader Enhancements

  • Stateful DataDataLoader: We are excited to announce that early support has been added for the StatefulDataLoader from torchdata, allowing better handling of data loading states. Enable by passing use_stateful_dataloader=True to the DataLoaderConfiguration, and when calling load_state() the DataLoader will automatically be resumed from its last step, no more having to iterate through passed batches.
  • Decoupled Data Loader Preparation: The prepare_data_loader() function is now independent of the Accelerator, giving you more flexibility towards which API levels you would like to use.
  • XLA Compatibility: Added support for skipping initial batches when using XLA.
  • Improved State Management: Bug fixes and enhancements for saving/loading DataLoader states, ensuring smoother training sessions.
  • Epoch Setting: Introduced the set_epoch function for MpDeviceLoaderWrapper.

FP8 Training Improvements

  • Enhanced FP8 Training: Fully Sharded Data Parallelism (FSDP) and DeepSpeed support now work seamlessly with TransformerEngine FP8 training, including better defaults for the quantized FP8 weights.
  • Integration baseline: We've added a new suite of examples and benchmarks to ensure that our TransformerEngine integration works exactly as intended. These scripts run one half using 🤗 Accelerate's integration, the other with raw TransformersEngine, providing users with a nice example of what we do under the hood with accelerate, and a good sanity check to make sure nothing breaks down over time. Find them here
  • Import Fixes: Resolved issues with import checks for the Transformers Engine that has downstream issues.
  • FP8 Docker Images: We've added new docker images for TransformerEngine and accelerate as well. Use docker pull huggingface/accelerate@gpu-fp8-transformerengine to quickly get an environment going.

torchpippy no more, long live torch.distributed.pipelining

  • With the latest PyTorch release, torchpippy is now fully integrated into torch core, and as a result we are exclusively supporting the PyTorch implementation from now on
  • There are breaking examples and changes that comes from this shift. Namely:
    • Tracing of inputs is done with a shape each GPU will see, rather than the size of the total batch. So for 2 GPUs, one should pass in an input of [1, n, n] rather than [2, n, n] as before.
    • We no longer support Encoder/Decoder models. PyTorch tracing for pipelining no longer supports encoder/decoder models, so the t5 example has been removed.
    • Computer vision model support currently does not work: There are some tracing issues regarding resnet's we are actively looking into.
  • If either of these changes are too breaking, we recommend pinning your accelerate version. If the encoder/decoder model support is actively blocking your inference using pippy, please open an issue and let us know. We can look towards adding in the old support for torchpippy potentially if needed.

Fully Sharded Data Parallelism (FSDP)

  • Environment Flexibility: Environment variables are now fully optional for FSDP, simplifying configuration. You can now fully create a FullyShardedDataParallelPlugin yourself manually with no need for environment patching:
from accelerate import FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(...)
  • FSDP RAM efficient loading: Added a utility to enable RAM-efficient model loading (by setting the proper environmental variable). This is generally needed if not using accelerate launch and need to ensure the env variables are setup properly for model loading:
from accelerate.utils import enable_fsdp_ram_efficient_loading, disable_fsdp_ram_efficient_loading
enable_fsdp_ram_efficient_loading()
  • Model State Dict Management: Enhanced support for unwrapping model state dicts in FSDP, making it easier to manage distributed models.

New Examples

Bug Fixes

New Contributors

Full Changelog:

Read more

v0.33.0: MUSA backend support and bugfixes

08 Aug 12:57
Compare
Choose a tag to compare

MUSA backend support and bugfixes

Small release this month, with key focuses on some added support for backends and bugs:

What's Changed

New Contributors

Full Changelog: v0.32.1...v0.33.0

v0.32.0: Profilers, new hooks, speedups, and more!

03 Jul 17:44
Compare
Choose a tag to compare

Core

  • Utilize shard saving from the huggingface_hub rather than our own implementation (#2795)
  • Refactor logging to use logger in dispatch_model (#2855)
  • The Accelerator.step number is now restored when using save_state and load_state (#2765)
  • A new profiler has been added allowing users to collect performance metrics during model training and inference, including detailed analysis of execution time and memory consumption. These can then be generated in Chrome's tracing tool. Read more about it here (#2883)
  • Reduced import times for doing import accelerate and any other major core import by 68%, now should be only slightly longer than doing import torch (#2845)
  • Fixed a bug in get_backend and added a clear_device_cache utility (#2857)

Distributed Data Parallelism

  • Introduce DDP communication hooks to have more flexibility in how gradients are communicated across workers, overriding the standard allreduce. (#2841)
  • Make log_line_prefix_template optional the notebook_launcher (#2888)

FSDP

  • If the output directory doesn't exist when using accelerate merge-weights, one will be automatically created (#2854)
  • When merging weights, the default is now .safetensors (#2853)

XPU

  • Migrate to pytorch's native XPU backend on torch>=2.4 (#2825)
  • Add @require_triton test decorator and enable test_dynamo work on xpu (#2878)
  • Fixed load_state_dict not working on xpu and refine xpu safetensors version check (#2879)

XLA

  • Added support for XLA Dynamo backends for both training and inference (#2892)

Examples

  • Added a new multi-cpu SLURM example using accelerate launch (#2902)

Full Changelog

New Contributors

Full Changelog: v0.31.0...v0.32.0