We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7_gpu_pp1.log 31_gpu_pp4.log
在跑llama2 70B(减少层数)时,PP=1跟PP=4出现loss下降趋势不同的情况,log与曲线图见上述上传,脚本如下:
export CUDA_DEVICE_MAX_CONNECTIONS=1 GPUS_PER_NODE=8 # Change for multinode config MASTER_ADDR=192.167.5.2 MASTER_PORT=29501 NUM_NODES=4 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES)) CHECKPOINT_PATH='/data/zhangling21/ckpts/' TENSORBOARD_LOGS_PATH='/data/zhangling21/tensorboard_logs/' TOKENIZER_PATH='/data/zhangling21/llama_00_text_document/tokenizer/tokenizer.model' DATA_PATH='/data/zhangling21/llama_00_text_document/llama_00_text_document' DISTRIBUTED_ARGS=( --nproc_per_node $GPUS_PER_NODE --nnodes $NUM_NODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT ) # --tokenizer-type LLaMASentencePieceTokenizer \ # --rmsnorm-epsilon 1e-5 LLAMA_MODEL_ARGS=( --num-layers 8 --hidden-size 8192 --ffn-hidden-size 28672 --num-attention-heads 64 --seq-length 4096 --max-position-embeddings 4096 --group-query-attention --num-query-groups 8 --tokenizer-type Llama2Tokenizer --tokenizer-model $TOKENIZER_PATH --swiglu --normalization RMSNorm --use-rotary-position-embeddings --no-position-embedding --disable-bias-linear ) # --optimizer adam # --adam-eps 1e-05 # --no-contiguous-buffers-in-local-ddp # --recompute-method uniform # --no-async-tensor-model-parallel-allreduce # --embedding-dropout 0 # --multi-query-attention # --multi-query-group-num 8 # --ffn-dim-multiplier 1.3 # --recompute-granularity full # --distribute-saved-activations # --recompute-num-layers 1 # --memory-saving # --fp16 # --optimizer adam # --adam-eps 1e-05 TRAINING_ARGS=( --micro-batch-size 1 --global-batch-size 44 --train-samples 24414 --weight-decay 1e-2 --optimizer adam --clip-grad 1.0 --lr 0.00015 --lr-decay-style cosine --min-lr 1.0e-5 --lr-warmup-fraction .01 --adam-beta1 0.9 --adam-beta2 0.95 --attention-dropout 0.0 --hidden-dropout 0.0 --untie-embeddings-and-output-weights --multiple-of 4096 --no-gradient-accumulation-fusion --recompute-granularity 'full' --recompute-num-layers 1 --recompute-method 'uniform' --no-async-tensor-model-parallel-allreduce ) MODEL_PARALLEL_ARGS=( --tensor-model-parallel-size 8 --pipeline-model-parallel-size 4 ) DATA_ARGS=( --data-path $DATA_PATH --split 1 ) EVAL_AND_LOGGING_ARGS=( --log-interval 1 --init-method-std 0.02 --seed 1234 --eval-iters 0 --use-cpu-initialization ) #--load "/data/zhangling21/llama_00_text_document/ckpt0227_8L" #--no-load-rng #--save "/data/zhangling21/llama_00_text_document/ckpt0227_8L" #--save-interval 1 cmd="torchrun ${DISTRIBUTED_ARGS[@]} pretrain_llama.py \ ${LLAMA_MODEL_ARGS[@]} \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${DATA_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]}" echo $cmd eval $cmd
The text was updated successfully, but these errors were encountered:
@zhaoyinglia 您好,能麻烦看一下这个问题吗? @aoyulong 最近比较忙
Sorry, something went wrong.
No branches or pull requests
7_gpu_pp1.log
31_gpu_pp4.log
在跑llama2 70B(减少层数)时,PP=1跟PP=4出现loss下降趋势不同的情况,log与曲线图见上述上传,脚本如下:
The text was updated successfully, but these errors were encountered: