You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I downloaded a llama 7B model. It only get one model file which ends with .pth. But as the model loading code in llama_model .py showed as below says. If I want train the model with multi gpus, I need to divide the model into the same number as the graphics card. May I ask how should do that? or is there anything I did not understand?
def load_checkpoints(
ckpt_dir: str, local_rank: int, world_size: int
) -> Tuple[dict, dict]:
checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
assert world_size == len(checkpoints), (
f"Loading a checkpoint for MP={len(checkpoints)} but world " # world size means numbers of gpus used right?
f"size is {world_size}"
)
ckpt_path = checkpoints[local_rank]
print("Loading")
checkpoint = torch.load(ckpt_path, map_location="cpu")
with open(Path(ckpt_dir) / "params.json", "r") as f:
params = json.loads(f.read())
return checkpoint, params
The text was updated successfully, but these errors were encountered:
I downloaded a llama 7B model. It only get one model file which ends with .pth. But as the model loading code in llama_model .py showed as below says. If I want train the model with multi gpus, I need to divide the model into the same number as the graphics card. May I ask how should do that? or is there anything I did not understand?
def load_checkpoints(
ckpt_dir: str, local_rank: int, world_size: int
) -> Tuple[dict, dict]:
checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
assert world_size == len(checkpoints), (
f"Loading a checkpoint for MP={len(checkpoints)} but world " # world size means numbers of gpus used right?
f"size is {world_size}"
)
ckpt_path = checkpoints[local_rank]
print("Loading")
checkpoint = torch.load(ckpt_path, map_location="cpu")
with open(Path(ckpt_dir) / "params.json", "r") as f:
params = json.loads(f.read())
return checkpoint, params
The text was updated successfully, but these errors were encountered: