Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:composer.cli.launcher:Global rank 0 (PID 208865) exited with code -11 #1501

Open
AndrewHYC opened this issue Aug 30, 2024 · 1 comment
Labels
question Further information is requested

Comments

@AndrewHYC
Copy link

When I am doing finetune llama3.1 does the following error occurs, can't locate the exact error, how to fix it please?

Running environment:

Python 3.11.0rc1
GPU: 2xA100
CUDA Version: 12.2

Full log:

2024-08-30 11:05:27.361223: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-30 11:05:31,112: rank0[208865][MainThread]: DEBUG: llmfoundry.command_utils.train: Initializing dist with device...
2024-08-30 11:05:31,291: rank0[208865][MainThread]: DEBUG: llmfoundry.command_utils.train: Testing barrier with device...
2024-08-30 11:05:32,167: rank0[208865][MainThread]: DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.
/Workspace/Shared/Groups/a100-shared-group/andreihuang/llm-foundry/llmfoundry/utils/config_utils.py:527: UserWarning: Setting `sync_module_states = True` for FSDP. This is required when using mixed initialization.
  warnings.warn((
2024-08-30 11:05:32,217: rank0[208865][MainThread]: INFO: llmfoundry.command_utils.train: Building tokenizer...
2024-08-30 11:05:32,927: rank0[208865][MainThread]: INFO: llmfoundry.command_utils.train: Building train loader...
2024-08-30 11:05:32,929: rank0[208865][MainThread]: INFO: llmfoundry.data.finetuning.tasks: No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
/Workspace/Shared/Groups/a100-shared-group/andreihuang/llm-foundry/llmfoundry/data/finetuning/tasks.py:991: UserWarning: Dropped 338 examples where the prompt was longer than 4096, the prompt or response was empty, or the response was all padding tokens.
  warnings.warn(
2024-08-30 11:05:42,479: rank0[208865][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Local rank 0 finished data prep
2024-08-30 11:05:50,227: rank0[208865][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: All ranks finished data prep
2024-08-30 11:05:50,229: rank0[208865][MainThread]: INFO: llmfoundry.command_utils.train: Building eval loader...
2024-08-30 11:05:50,230: rank0[208865][MainThread]: INFO: llmfoundry.data.finetuning.tasks: No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
2024-08-30 11:05:59,704: rank0[208865][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Local rank 0 finished data prep
2024-08-30 11:06:07,431: rank0[208865][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: All ranks finished data prep
2024-08-30 11:06:07,433: rank0[208865][MainThread]: INFO: llmfoundry.command_utils.train: Initializing model...
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py:957: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:924: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:469: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  3.72it/s]
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:469: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.38it/s]
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:924: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
2024-08-30 11:06:12,920: rank0[208865][MainThread]: INFO: llmfoundry.command_utils.train: Building trainer...
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/composer/trainer/trainer.py:256: UserWarning: `device_train_microbatch_size='auto'` may potentially fail with unexpected CUDA errors. Auto microbatching attempts to catch CUDA Out of Memory errors and adjust the batch size, but it is possible CUDA will be put into an irrecoverable state due to PyTorch bugs, e.g. integer overflow. In this case, please manually set device_train_microbatch_size explicitly to an integer instead.
  warnings.warn((
2024-08-30 11:06:12,922: rank0[208865][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2024-08-30 11:06:12,928: rank0[208865][MainThread]: INFO: composer.trainer.trainer: Run name: llm
2024-08-30 11:06:12,930: rank0[208865][MainThread]: INFO: composer.core.state: Automatically setting data_parallel_shard to have parallelization degree 2.
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/composer/callbacks/memory_monitor.py:137: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
2024-08-30 11:06:14,203: rank0[208865][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2024-08-30 11:06:14,233: rank0[208865][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2024-08-30 11:06:22,784: rank0[208865][MainThread]: DEBUG: composer.utils.reproducibility: Restoring the RNG state
2024-08-30 11:06:22,785: rank0[208865][MainThread]: INFO: composer.trainer.trainer: Setting seed to 17
2024-08-30 11:06:22,785: rank0[208865][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2024-08-30 11:06:22,790: rank0[208865][MainThread]: INFO: llmfoundry.command_utils.train: Logging config
variables:
  tokenizer_name: /local_disk0/models/Meta-Llama-3.1-8B-Instruct
  global_seed: 17
  max_seq_len: 4096
max_seq_len: 4096
max_duration: 3ep
eval_first: false
eval_interval: 1ep
eval_subset_num_batches: -1
global_train_batch_size: 8
run_name: null
max_split_size_mb: 512
model:
  name: hf_causal_lm
  init_device: mixed
  pretrained_model_name_or_path: /local_disk0/models/Meta-Llama-3.1-8B-Instruct
  pretrained: true
  use_auth_token: false
  use_flash_attention_2: true
tokenizer:
  name: /local_disk0/models/Meta-Llama-3.1-8B-Instruct
  kwargs:
    model_max_length: 4096
train_loader:
  name: finetuning
  dataset:
    hf_name: json
    hf_kwargs:
      data_dir: ./data/
    split: train
    max_seq_len: 4096
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0
eval_loader:
  name: finetuning
  dataset:
    hf_name: json
    hf_kwargs:
      data_dir: ./data/
    split: test
    max_seq_len: 4096
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: false
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1
optimizer:
  name: decoupled_lionw
  lr: 5.0e-07
  betas:
  - 0.9
  - 0.95
  weight_decay: 0.0
algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0
seed: 17
device_eval_batch_size: 2
device_train_microbatch_size: auto
precision: amp_bf16
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true
  sync_module_states: true
  load_monolith_rank0_only: true
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
load_weights_only: true
save_interval: 1ep
save_num_checkpoints_to_keep: 3
save_folder: /Volumes/tencent-ml/default/external_volume/andreihuang/{run_name}/checkpoints
n_gpus: 2
device_train_batch_size: 4
device_train_grad_accum: auto
merge: true
n_params: 8030261248
n_active_params: 8030261248
n_trainable_params: 8030261248

2024-08-30 11:06:23,098: rank0[208865][MainThread]: INFO: llmfoundry.command_utils.train: Starting training...
2024-08-30 11:06:23,098: rank0[208865][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16
[2024-08-30 11:06:23,240] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
ERROR:composer.cli.launcher:Rank 1 crashed with exit code -11.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 208865) exited with code -11
Global rank 1 (PID 208866) exited with code -11
----------Begin global rank 1 STDOUT----------
variables:
  tokenizer_name: /local_disk0/models/Meta-Llama-3.1-8B-Instruct
  global_seed: 17
  max_seq_len: 4096
max_seq_len: 4096
max_duration: 3ep
eval_first: false
eval_interval: 1ep
eval_subset_num_batches: -1
global_train_batch_size: 8
run_name: null
max_split_size_mb: 512
model:
  name: hf_causal_lm
  init_device: mixed
  pretrained_model_name_or_path: /local_disk0/models/Meta-Llama-3.1-8B-Instruct
  pretrained: true
  use_auth_token: false
  use_flash_attention_2: true
tokenizer:
  name: /local_disk0/models/Meta-Llama-3.1-8B-Instruct
  kwargs:
    model_max_length: 4096
train_loader:
  name: finetuning
  dataset:
    hf_name: json
    hf_kwargs:
      data_dir: ./data/
    split: train
    max_seq_len: 4096
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0
eval_loader:
  name: finetuning
  dataset:
    hf_name: json
    hf_kwargs:
      data_dir: ./data/
    split: test
    max_seq_len: 4096
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: false
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1
optimizer:
  name: decoupled_lionw
  lr: 5.0e-07
  betas:
  - 0.9
  - 0.95
  weight_decay: 0.0
algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0
seed: 17
device_eval_batch_size: 2
device_train_microbatch_size: auto
precision: amp_bf16
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true
  sync_module_states: true
  load_monolith_rank0_only: true
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
load_weights_only: true
save_interval: 1ep
save_num_checkpoints_to_keep: 3
save_folder: ./{run_name}/checkpoints
n_gpus: 2
device_train_batch_size: 4
device_train_grad_accum: auto
merge: true
n_params: 8030261248
n_active_params: 8030261248
n_trainable_params: 8030261248

[2024-08-30 11:06:23,240] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)

----------End global rank 1 STDOUT----------
----------Begin global rank 1 STDERR----------
2024-08-30 11:05:27.438049: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-30 11:05:31,077: rank1[208866][MainThread]: DEBUG: llmfoundry.command_utils.train: Initializing dist with device...
2024-08-30 11:05:31,287: rank1[208866][MainThread]: DEBUG: llmfoundry.command_utils.train: Testing barrier with device...
2024-08-30 11:05:32,166: rank1[208866][MainThread]: DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.
/Workspace/Shared/Groups/a100-shared-group/andreihuang/llm-foundry/llmfoundry/utils/config_utils.py:527: UserWarning: Setting `sync_module_states = True` for FSDP. This is required when using mixed initialization.
  warnings.warn((
2024-08-30 11:05:32,217: rank1[208866][MainThread]: INFO: llmfoundry.command_utils.train: Building tokenizer...
2024-08-30 11:05:32,927: rank1[208866][MainThread]: INFO: llmfoundry.command_utils.train: Building train loader...
2024-08-30 11:05:32,929: rank1[208866][MainThread]: INFO: llmfoundry.data.finetuning.tasks: No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
2024-08-30 11:05:32,929: rank1[208866][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Waiting for local_rank 0 to finish data prep
/Workspace/Shared/Groups/a100-shared-group/andreihuang/llm-foundry/llmfoundry/data/finetuning/tasks.py:991: UserWarning: Dropped 338 examples where the prompt was longer than 4096, the prompt or response was empty, or the response was all padding tokens.
  warnings.warn(
2024-08-30 11:05:50,227: rank1[208866][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: All ranks finished data prep
2024-08-30 11:05:50,228: rank1[208866][MainThread]: INFO: llmfoundry.command_utils.train: Building eval loader...
2024-08-30 11:05:50,229: rank1[208866][MainThread]: INFO: llmfoundry.data.finetuning.tasks: No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
2024-08-30 11:05:50,229: rank1[208866][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Waiting for local_rank 0 to finish data prep
2024-08-30 11:06:07,430: rank1[208866][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: All ranks finished data prep
2024-08-30 11:06:07,431: rank1[208866][MainThread]: INFO: llmfoundry.command_utils.train: Initializing model...
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py:957: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:924: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
2024-08-30 11:06:12,911: rank1[208866][MainThread]: INFO: llmfoundry.command_utils.train: Building trainer...
/local_disk0/.ephemeral_nfs/envs/pythonEnv-8060316e-1051-4bfa-9ae2-70ff8b1ada0d/lib/python3.11/site-packages/composer/trainer/trainer.py:256: UserWarning: `device_train_microbatch_size='auto'` may potentially fail with unexpected CUDA errors. Auto microbatching attempts to catch CUDA Out of Memory errors and adjust the batch size, but it is possible CUDA will be put into an irrecoverable state due to PyTorch bugs, e.g. integer overflow. In this case, please manually set device_train_microbatch_size explicitly to an integer instead.
  warnings.warn((
2024-08-30 11:06:12,922: rank1[208866][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 18
2024-08-30 11:06:12,928: rank1[208866][MainThread]: INFO: composer.trainer.trainer: Run name: llm
2024-08-30 11:06:12,930: rank1[208866][MainThread]: INFO: composer.core.state: Automatically setting data_parallel_shard to have parallelization degree 2.
2024-08-30 11:06:14,219: rank1[208866][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2024-08-30 11:06:14,233: rank1[208866][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2024-08-30 11:06:21,766: rank1[208866][MainThread]: DEBUG: composer.utils.reproducibility: Restoring the RNG state
2024-08-30 11:06:21,767: rank1[208866][MainThread]: INFO: composer.trainer.trainer: Setting seed to 18
2024-08-30 11:06:21,767: rank1[208866][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 18
2024-08-30 11:06:21,773: rank1[208866][MainThread]: INFO: llmfoundry.command_utils.train: Logging config
2024-08-30 11:06:23,198: rank1[208866][MainThread]: INFO: llmfoundry.command_utils.train: Starting training...
2024-08-30 11:06:23,198: rank1[208866][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16

ERROR:composer.cli.launcher:Global rank 0 (PID 208865) exited with code -11
----------End global rank 1 STDERR----------
@AndrewHYC AndrewHYC added the question Further information is requested label Aug 30, 2024
@sahilempire
Copy link

The log suggests that the issue may be related to how the model and precision settings are configured for the GPU environment. Here are some steps you can try to resolve it:

  1. Check Precision Settings:
    The log shows a warning about the torch.float32 precision for Flash Attention, which only supports torch.float16 and torch.bfloat16. Ensure the model is using the correct precision setting by adding torch_dtype=torch.float16 when loading the model, such as:

    model = AutoModel.from_pretrained("openai/whisper-tiny", torch_dtype=torch.float16)
  2. Model to GPU:
    It looks like the model might not be fully moved to the GPU. After loading the model, use model.to("cuda") to ensure it's on the GPU.

  3. FSDP Sync Setting:
    The warning about sync_module_states=True in FSDP suggests the need for sync_module_states to avoid issues in distributed settings. Check if this setting is explicitly defined and set to True.

  4. Set device_train_microbatch_size:
    The log mentions a potential issue with device_train_microbatch_size="auto". Manually set it to a fixed integer (e.g., 2 or 4) to see if it stabilizes training:

    device_train_microbatch_size: 4
  5. CUDA Errors:
    Check the environment for potential CUDA issues by testing a small batch of data on the GPU before starting the full training process. You could also update to the latest CUDA-compatible versions of PyTorch and Transformers to ensure compatibility with CUDA 12.2.

  6. System Logs for Segmentation Fault (-11):
    The exit code -11 indicates a segmentation fault. It may be related to memory allocation. Monitor GPU memory usage closely to ensure no other processes are consuming significant GPU resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants