Releases: awslabs/slapo
v0.0.3
This release mainly improves
- Fix some fidelity issues.
- Refactor schedule primitives, and add
.fork_rng()
,.annotate()
, and.replace_all()
primitives. - Other bug fixing.
If any of the following cases match your existing schedule based on v0.0.2, you have to change them to support v0.0.3.
- Tagging parameters for DeepSpeed pipeline runtime to perform an additional all-reduce on TP group. For example, you may have the following code snippet that tags LayerNorm parameters:
def tag_layernorm(sch):
for m in sch.mod.modules():
if isinstance(m, nn.LayerNorm):
for p in m.parameters(recurse=False):
p.replicated_param = True
This can be changed to the following in v0.0.3:
def annotate_layernorm_and_bias(sch):
for sub_sch in sch.child.values():
if isinstance(sub_sch.mod, nn.LayerNorm):
for name, _ in sub_sch.mod.named_parameters(recurse=False):
sub_sch.annotate(name, "replicated_param", True)
if issubclass(sub_sch.mod.__class__, LinearWithSyncFunc):
sub_sch.annotate("bias", "replicated_param", True)
annotate_layernorm_and_bias(sub_sch)
Reference: https://github.com/awslabs/slapo/blob/main/slapo/model_schedule/gpt2.py#L529
- RNG control can be done easily with a new introduced schedule primitive
.fork_rng()
. Accordingly, the oldslapo.op.AttentionOpWithRNG
is removed. If you have the following code snippet:
new_op = AttentionOpWithRNG(
sub_sch["module"]["attn_op"].mod.attn_op_name,
sub_sch["module"]["attn_op"].mod.apply_causal_mask,
sub_sch["module"]["attn_op"].mod.scale,
)
sub_sch["module"]["attn_op"].replace(new_op)
It has to be changed to
sub_sch["module"]["attn_op"].fork_rng()
-
The primitive
.trace_for_pipeline()
has been renamed to.trace_until()
. Since the arguments remain the same, you could simply replace all occurrences. -
If you use
slapo.op.FusedMLP
with sharding, you need to change your schedule to reflect the change of FusedMLP implementation. For example:
fc_names = ["fc_in", "act", "fc_out"]
sub_sch[fc_names[0]].shard("weight", axis=0)
sub_sch[fc_names[1]].shard("bias", axis=0)
sub_sch[fc_names[2]].shard("weight", axis=1)
sub_sch[fc_names[0]].sync(mode="bwd_post", sync_op_or_fn="all_reduce")
sub_sch[fc_names[2]].sync(mode="fwd_post", sync_op_or_fn="all_reduce")
changes to
fc_names = ["fc_in", "fc_out"]
sub_sch[fc_names[0]].shard("weight", axis=0)
sub_sch[fc_names[0]].shard("bias", axis=0)
sub_sch[fc_names[1]].shard("weight", axis=1)
sub_sch[fc_names[0]].sync(mode="bwd_post", sync_op_or_fn="all_reduce")
sub_sch[fc_names[1]].sync(mode="fwd_post", sync_op_or_fn="all_reduce")
What's Changed
- [Action] Fix release flow by @comaniac in #69
- [Refactor] Schedule primitives by @comaniac in #68
- [Primitive] .fork_rng() by @comaniac in #70
- [Primitive] .annotate() and .trace_until() by @comaniac in #71
- [CI] Update CI rules for docs by @chhzh123 in #72
- [Op] Fuse bias+dropout in FusedMLP by @comaniac in #73
- [Refactor] Modulize sharding methods by @comaniac in #74
- [CI] Quick fix by @chhzh123 in #75
- [Primitive][fork_rng] Do not replace module by @comaniac in #76
- [Bugfix] Include other custom LinearWithXX by @comaniac in #77
- [Primitive] Add fallback fusion by @chhzh123 in #78
- [examples] Refactor dataloader to support BERT by @chhzh123 in #79
- [Bugfix] Shard embedding hooks by @comaniac in #80
- [Version] Refactor version updating logic by @comaniac in #82
- [Op] Print by @comaniac in #81
- [Primitive] Add .replace_all() by @chhzh123 in #85
- [Version] Update version to v0.0.3 by @chhzh123 in #84
Full Changelog: v0.0.2...v0.0.3
v0.0.2
This release mainly improves
- More unit tests.
- Add
.fuse
and related primitives. - Improve overall training efficiency of GPT models by adding sequence parallelism, tie weight supports, etc.
- Documentation and tutorials.
- Bug fixing.
What's Changed
- [Release] Setup wheel and release scripts by @comaniac in #18
- [Pipeline] Drop last batch in DeepSpeed scripts by @comaniac in #19
- [Examples] Add disable_flash_attn by @chhzh123 in #22
- [Bugfix] Fix sequence parallelism by @szhengac in #20
- [Schedule][replace] Transfer hooks when replacing modules by @comaniac in #27
- [Bugfix] Fix GPT script by @szhengac in #26
- [Bugfix] Transfer hooks in pipeline modules by @comaniac in #28
- [Tracer] Add
flatten
argument to .trace() by @chhzh123 in #29 - [Benchmark] Fix ZeRO-3 step log by @comaniac in #31
- [Bugfix] Fix for sharding TP only by @zarzen in #32
- [Primitive][shard] Use autograd function for all sync ops by @comaniac in #33
- [Bugfix] Using None for mpu when PP > 1 by @zarzen in #34
- [Bugfix] Fix GPT script by @szhengac in #36
- [Schedule] Refactor subgraph matching by @chhzh123 in #35
- [Schedule] Add .fuse() primitive by @chhzh123 in #25
- [Setup] Fix dependency by @chhzh123 in #39
- [Random] Random state management by @comaniac in #38
- [GPT] Use flash-attention and enable dropout by @comaniac in #40
- [Op] Add attention and bias_gelu ops by @comaniac in #41
- [Tracer] Remove SelfAttention renaming by @chhzh123 in #44
- [Model] Add HuggingFace GPT-2 by @comaniac in #45
- [Op] Refactor qkv processing by @comaniac in #46
- Add num_workers to GPT dataloader by @szhengac in #48
- [Op] Add flash-attention CUDA kernel by @comaniac in #49
- [Bugfix] Fix tensor device by @szhengac in #50
- [Example] Use .fuse() primitive when possible by @chhzh123 in #42
- [Refactor] model_dialect -> framework_dialect by @comaniac in #51
- [Test] Add default initialization test by @chhzh123 in #54
- [Schedule] Create subschedule for subgraph replacement by @chhzh123 in #52
- [Schedule] Support partial checkpointing by @chhzh123 in #55
- [DeepSpeed] Support TP=nGPU and PP=DP=1 by @comaniac in #56
- [Examples] Move examples to slapo.model_schedule by @chhzh123 in #53
- [Bugfix] Support tree-like subgraph matching by @chhzh123 in #58
- [Bugfix] Consolidate params with orig size by @comaniac in #59
- [Bugfix] Fix a small device bug by @szhengac in #57
- [README] Temporary remove paper info by @comaniac in #60
- Add param_name to shard infer type and fix consolidate by @comaniac in #62
- [Feature] Layernorm Tag by @szhengac in #61
- [Docs] Add initial documentations by @chhzh123 in #63
- Enable launch training with torchrun by @zarzen in #64
- [Examples] Enable launch with torchrun by @comaniac in #65
New Contributors
Full Changelog: v0.0.1...v0.0.2
First release of v0.0.1
First release of v0.0.1
What's Changed
- [Lint] Fix almost all linting errors by @comaniac in #1
- [CI] Setup CI by @comaniac in #3
- [Lint] Fix rest linting errors by @comaniac in #2
- [Bugfix] Fix batch size in slapo-deepspeed by @chhzh123 in #7
- Fix transformers import order in megatron scripts by @szhengac in #5
- [Pipeline] Tie weight analysis by @comaniac in #8
- [Bugfix] fix initialization by @szhengac in #4
- [Bugfix] Reproduce experimental results in docker image by @chhzh123 in #9
- [Schedule] Support sequence parallelism by @comaniac in #6
- [Test] Add end-to-end tests by @chhzh123 in #14
- [Pipeline] Register tie weights by @comaniac in #15
- [Bugfix] Fix schedule and dockerfile by @comaniac in #17
- [Test] Add tracer unit tests by @chhzh123 in #16
New Contributors
- @comaniac made their first contribution in #1
- @chhzh123 made their first contribution in #7
- @szhengac made their first contribution in #5
Full Changelog: https://github.com/awslabs/slapo/commits/v0.0.1