Name		Name	Last commit message	Last commit date
parent directory ..
cover		cover
images		images
01.introduction.pdf		01.introduction.pdf
01.introduction.pptx		01.introduction.pptx
02.architecture.pdf		02.architecture.pdf
02.architecture.pptx		02.architecture.pptx
03.communication.pdf		03.communication.pdf
03.communication.pptx		03.communication.pptx
04.primitive.pdf		04.primitive.pdf
04.primitive.pptx		04.primitive.pptx
05.system.pdf		05.system.pdf
05.system.pptx		05.system.pptx
06.challenge.pdf		06.challenge.pdf
06.challenge.pptx		06.challenge.pptx
07.algorithm_arch.pdf		07.algorithm_arch.pdf
07.algorithm_arch.pptx		07.algorithm_arch.pptx
08.algorithm_sota.pdf		08.algorithm_sota.pdf
08.algorithm_sota.pptx		08.algorithm_sota.pptx
README.md		README.md

README.md

分布式训练

什么是大模型？大模型模型参数量实在太大，需要分布式并行训练能力一起来加速训练过程。分布式并行是在大规模AI集群上工作的，想要加速就需要软硬件协同，不仅仅要解决通信拓扑的问题、集群组网的问题，还要了解上层MOE、Transform等新兴算法。通过对算法的剖析，提出模型并行、数据并行、优化器并行等新的并行模式和通信同步模式，来加速分布式训练的过程。最小的单机执行单元里面，还要针对大模型进行混合精度、梯度累积等算法，进一步压榨集群的算力！

内容大纲

建议优先下载或者使用PDF版本，PPT版本会因为字体缺失等原因导致版本很丑哦~

编号	名称	名称	备注

	分布式集群	01 基本介绍	silde, video
	分布式集群	02 AI集群服务器架构	silde, video
	分布式集群	03 AI集群软硬件通信	silde, video
	分布式集群	04 集合通信原语	silde, video
	分布式算法	05 AI框架分布式功能	silde, video

5	分布式算法	06 大模型训练的挑战	silde, video
	分布式算法	07 算法：大模型算法结构	silde, video
	分布式算法	08 算法：亿级规模SOTA大模型	silde, video

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AICluster

AICluster

README.md

分布式训练

内容大纲

Files

AICluster

Directory actions

More options

Directory actions

More options

Latest commit

History

AICluster

Folders and files

parent directory

README.md

分布式训练

内容大纲