Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to bootstrap cluster stacks using custom AMIs built from the ParallelCluster-blessed base ubuntu2204 image #6552

Open
rmarable-flaretx opened this issue Nov 8, 2024 · 1 comment
Labels

Comments

@rmarable-flaretx
Copy link

We are experiencing issues bootstrapping ParallelCluster stacks when using a custom AMI built from the ParallelCluster-blessed ubuntu2204 image.

The image was built using this guidance from the public AWS documentation (we used ami-0b12c07d044901fda).
https://docs.aws.amazon.com/parallelcluster/latest/ug/building-custom-ami-v3.html#modify-an-aws-parallelcluster-ami-v3

The image successfully builds but attempts to launch a stack fail due a HeadNodeBootstrapFailure.

<snipped>

  "cloudFormationStackStatus": "CREATE_FAILED",
  "clusterName": "carlotta",
  "computeFleetStatus": "UNKNOWN",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-2:*:stack/carlotta/cd9d05b0-9deb-11ef-9935-06c9041a1ea1",
  "lastUpdatedTime": "2024-11-08T16:09:16.575Z",
  "region": "us-east-2",
  "clusterStatus": "CREATE_FAILED",
  "scheduler": {
    "type": "slurm"
  },
  "failures": [
    {
      "failureCode": "HeadNodeBootstrapFailure",
      "failureReason": "Failed to set up the head node."
    }

Here is the stacktrace file referenced in the chef-client log:

Generated at 2024-11-08 16:22:56 +0000
Mixlib::ShellOut::ShellCommandFailed: execute[check if clustermgtd heartbeat is available] (aws-parallelclus
ter-slurm::finalize_head_node line 19) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process
to exit with [0], but received '1'
---- Begin output of cat /opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat ----
STDOUT:
STDERR: cat: /opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat: No such file or directory
---- End output of cat /opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat ----
Ran cat /opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat returned 1
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/mixlib-shellout-3.2.7/lib/mixlib/shellout.rb:300:in `invalid!'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/mixlib-shellout-3.2.7/lib/mixlib/shellout.rb:287:in `error!'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/mixlib-shellout-3.2.7/lib/mixlib/shellout/helper.rb:130:in `shel
l_out_compacted!'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/mixlib-shellout-3.2.7/lib/mixlib/shellout/helper.rb:54:in `shell
_out!'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/provider/execute.rb:52:in `block (2 levels
) in <class:Execute>'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/mixin/why_run.rb:51:in `add_action'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/provider.rb:293:in `converge_by'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/provider/execute.rb:50:in `block in <class
:Execute>'
(eval):2:in `block in action_run'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/provider.rb:304:in `instance_eval'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/provider.rb:304:in `compile_and_converge_a
ction'
(eval):2:in `action_run'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/provider.rb:245:in `run_action'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource.rb:601:in `block in run_action'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource.rb:628:in `with_umask'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource.rb:600:in `run_action'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/runner.rb:74:in `run_action'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/runner.rb:108:in `block in run_all_actions
'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/runner.rb:108:in `each'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/runner.rb:108:in `run_all_actions'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/runner.rb:132:in `block in converge'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource_collection/resource_list.rb:96:in
 `block in execute_each_resource'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource_collection/stepable_iterator.rb:1
14:in `call_iterator_block'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource_collection/stepable_iterator.rb:8
5:in `step'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource_collection/stepable_iterator.rb:1
03:in `iterate'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource_collection/stepable_iterator.rb:5
4:in `each_with_index'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/resource_collection/resource_list.rb:94:in
 `execute_each_resource'
/opt/cinc/embedded/lib/ruby/3.1.0/forwardable.rb:238:in `execute_each_resource'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/runner.rb:130:in `converge'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/client.rb:869:in `block in converge'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/client.rb:864:in `catch'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/client.rb:864:in `converge'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/client.rb:888:in `converge_and_save'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/client.rb:298:in `run'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/application.rb:305:in `run_with_graceful_e
xit_option'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/application.rb:281:in `block in run_chef_c
lient'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/local_mode.rb:42:in `with_server_connectiv
ity'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/application.rb:264:in `run_chef_client'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/application/base.rb:354:in `run_applicatio
n'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-18.4.12/lib/chef/application.rb:67:in `run'
/opt/cinc/embedded/lib/ruby/gems/3.1.0/gems/chef-bin-18.4.12/bin/cinc-client:25:in `<top (required)>'
/bin/cinc-client:183:in `load'
/bin/cinc-client:183:in `<main>'

This is from the cfn-init log:

2024-11-08 16:22:56,137 [ERROR] Error encountered during build of chefFinalize: Command chef failed
Traceback (most recent call last):
  File "/opt/parallelcluster/pyenv/versions/3.9.20/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py", line 579, in run_config
    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)
  File "/opt/parallelcluster/pyenv/versions/3.9.20/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/construction.py", line 277, in build
    changes['commands'] = CommandTool().apply(
  File "/opt/parallelcluster/pyenv/versions/3.9.20/envs/cfn_bootstrap_virtualenv/lib/python3.9/site-packages/cfnbootstrap/command_tool.py", line 127, in apply
    raise ToolError(u"Command %s failed" % name)
2024-11-08T16:22:56.137Z cfnbootstrap.construction_errors.ToolError: Command chef failed
2024-11-08 16:22:56,139 [ERROR] -----------------------BUILD FAILED!------------------------

Please make sure to add the following data in order to facilitate the root cause detection.

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.11.1
  • Full cluster configuration without any credentials or personal data.
Region: us-east-2
CustomS3Bucket: SomeBucket

Imds:
 ImdsSupport: v1.0

Image:
 Os: ubuntu2204
 CustomAmi: ami-03a34b992c89aa978

HeadNode:
 InstanceType: c5.2xlarge
 Networking:
   SubnetId: subnet-08a9c7be86a271297
   AdditionalSecurityGroups:
     - sg-0200e80b8c8a8011a
 Ssh:
   KeyName: pcluster_compchem_ubuntu2204_us-east-2
 Iam:
   AdditionalIamPolicies:
     - Policy: arn:aws:iam::2xxxxxxxxxx5:policy/pcluster_munge_secretsmanager_policy_us-east-2
     - Policy: arn:aws:iam::2xxxxxxxxxx5:policy/pcluster_rds_secretsmanager_policy_us-east-2
 LocalStorage:
   RootVolume:
     Size: 100
     VolumeType: gp3
     Iops: 5000
     Throughput: 500
 SharedStorageType: Efs
 CustomActions:
   OnNodeStart:
     Sequence:
       - Script: s3://some-bucket/pcluster-scripts/carlotta/fix-etc-dir.sh
   OnNodeConfigured:
     Sequence:
       - Script: s3://some-bucket/pcluster-scripts/carlotta/configure-pcluster-stack.sh

Scheduling:
 Scheduler: slurm
 SlurmSettings:
   ScaledownIdletime: 10
   MungeKeySecretArn: arn:aws:secretsmanager:us-east-2:2xxxxxxxxxx5:secret:pcluster-default-munge-key-YJ7gIG
   Database:
     Uri: pcluster-slurmdb.feefifoofum.us-east-2.rds.amazonaws.com:3306
     UserName: slurmdb_admin
     PasswordSecretArn: arn:aws:secretsmanager:us-east-2:2xxxxxxxxxx5:secret:pcluster_slurmdb_admin-VGiLEd
     DatabaseName: carlotta_slurm_acct_db
 SlurmQueues:
   - Name: main
     ComputeSettings:
       LocalStorage:
         RootVolume:
           Size: 100
           VolumeType: gp3
           Iops: 5000
           Throughput: 500
         EphemeralVolume:
           MountDir: /local_scratch
     Networking:
       SubnetIds:
       - subnet-08a9c7be86a271297
       AdditionalSecurityGroups:
         - sg-0200e80b8c8a8011a
       PlacementGroup:
         Enabled: true
     ComputeResources:
       - Name: main-compute
         Instances:
         - InstanceType: c5d.8xlarge
         MinCount: 0
         MaxCount: 24
         Efa:
           Enabled: false
     CapacityType: SPOT
     CustomActions:
       OnNodeStart:
         Sequence:
           - Script: s3://some-bucket/pcluster-scripts/carlotta/fix-etc-dir.sh
       OnNodeConfigured:
         Sequence:
           - Script: s3://some-bucket/pcluster-scripts/carlotta/configure-flaretx-pcluster-stack.sh
     Iam:
       AdditionalIamPolicies:
         - Policy: arn:aws:iam::2xxxxxxxxxx5:policy/pcluster_munge_secretsmanager_policy_us-east-2
         - Policy: arn:aws:iam::2xxxxxxxxxx5:policy/pcluster_rds_secretsmanager_policy_us-east-2

Any suggestions you can offer on how to get around this are welcomed.

Thanks,
Rodney

@rmarable-flaretx rmarable-flaretx changed the title Unable to bootstrap ubuntu2204 using custom AMIs built from the ParallelCluster-blessed base ubuntu2204 image Unable to bootstrap cluster stacks using custom AMIs built from the ParallelCluster-blessed base ubuntu2204 image Nov 8, 2024
@hgreebe
Copy link
Contributor

hgreebe commented Nov 11, 2024

Hello,
What changes did you make to the base ami and what do your custom actions do?
Could you supply the clustermgtd log?

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants