Performance degradation on multi-node pretrain #1836

HaebinShin · 2024-11-25T02:52:47Z

Hi, I'm experiencing performance degradation when using multi-node training with pretrain.py
I followed the continual pretraining tutorial using TinyLlama on the OpenWebMath 14B dataset.

I'm working with a bare-bones multi-node setup, where each node has 8 GPUs. For each node, I used the following commands:

fabric run --node-rank=$RANK
        --main-address=$IP
        --main-port=$PORT
        --num-nodes=$NODE_COUNT
        --devices=8 --accelerator=cuda
        /codes/litgpt/litgpt/__main__.py pretrain
        --config=config/tinyllama-openwebmath.yaml --train.micro_batch_size=4
        --out_dir=/checkpoints/$CKPT_DIR
        --logger_name=wandb
        --train.log_interval=1
        --data.init_args.data_path=/dataset/processed/open-web-math
        --train.save_interval=2000

According to the wandb logs, the total number of tokens trained is the same, but the iterations decrease proportionally with the number of nodes.

However, the final results show lower performance as the number of nodes increases.

	gsm8k	math	svamp	asdiv	mawps	tabmwp	mathqa	mmlu_stemm	sat_math	avg
32node	2.9	3.2	15.1	22.1	27.9	15.3	12.1	14.2	18.8	14.6
2node	4.1	3.6	17.9	29.7	38.7	15.9	12.3	15.8	18.8	17.4
1node	4.1	3	19.6	29.9	39.4	15.7	9.8	16.5	31.2	18.8

In wandb, it seems that the loss is recorded only for rank 0, so I understand why the loss curve might appear different.
However, I can't figure out why the overall performance decreases.

For clarify, all nodes are using the same learning rate and same global_batch_size.

I'd appreciate any advice on what might be causing this issue and what adjustments I should consider.

The text was updated successfully, but these errors were encountered:

HaebinShin added the question Further information is requested label Nov 25, 2024

HaebinShin mentioned this issue Nov 25, 2024

Gradient Accumulation Step under Multi-node Pretaining #1474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation on multi-node pretrain #1836

Performance degradation on multi-node pretrain #1836

HaebinShin commented Nov 25, 2024

Performance degradation on multi-node pretrain #1836

Performance degradation on multi-node pretrain #1836

Comments

HaebinShin commented Nov 25, 2024