This directory contains scripts for pretraining Megatron-protein
and distributed training script
.
You need to modify the path varibles, like DATA_PATH
(you need to specify the path to your processed_pfam_documents.bin/idx), iupac_vocab.txt
, etc..
You can also adjust the model's hyperparameters like hidden size
and number of attention heads
from the script.
We trained the model on 4 machines, each one with 8 Tesla V100(32G) GPUs.
For parallel training, you need to install mpirun (our version is OpenMPI-4.0.5). You can download Open MPI from here.
For communication among nodes, you need to set host alias in /etc/hosts
if your servers are Linux-based.
If you meet timeout problems in distributed training, you can try to set torch.distributed.init_process_group
's timeout
parameter to a longer duration (default value is 30mins) in initialize.py
.
from datetime import timedelta
# ...
torch.distributed.init_process_group(
backend=args.distributed_backend,
world_size=args.world_size, rank=args.rank,
init_method=init_method, timeout=timedelta(hours=8))
- hidden size = 1024
- number of layers = 16
- number of attention heads = 16
The model checkpoint can be downloaded from GoogleDrive, or TsinghuaCloud.
The pretraining is carried out on 4 machines, each one with 8 Tesla-V100(32G) GPUs for about 5 days.
Typically, for each iteration, forward compute costs 420 ms, backward compute costs 1000ms.
The protein model trained in Megatron-LM framework reached a ppl of 5.8.