diff --git a/404.html b/404.html
index 5fb6c662..12143129 100644
--- a/404.html
+++ b/404.html
@@ -12,7 +12,7 @@
       
       
       <link rel="icon" href="/assets/images/favicon.png">
-      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.42">
+      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.43">
     
     
       
diff --git a/design/index.html b/design/index.html
index 8655524b..94cf516b 100644
--- a/design/index.html
+++ b/design/index.html
@@ -16,7 +16,7 @@
       
       
       <link rel="icon" href="../assets/images/favicon.png">
-      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.42">
+      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.43">
     
     
       
diff --git a/developers/index.html b/developers/index.html
index a3b72951..69d5e376 100644
--- a/developers/index.html
+++ b/developers/index.html
@@ -16,7 +16,7 @@
       
       
       <link rel="icon" href="../assets/images/favicon.png">
-      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.42">
+      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.43">
     
     
       
diff --git a/index.html b/index.html
index 30abd715..1ed92887 100644
--- a/index.html
+++ b/index.html
@@ -14,7 +14,7 @@
       
       
       <link rel="icon" href="assets/images/favicon.png">
-      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.42">
+      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.43">
     
     
       
@@ -3097,6 +3097,7 @@ <h4 id="472-optional-attributes">4.7.2 Optional attributes</h4>
 </code></pre></div>
     This is only functional with <a href="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus">MIG supported GPUs</a>,
     and with x86-64 processors (see <a href="https://github.com/NVIDIA/mig-parted/issues/30">NVIDIA/mig-parted issue #30</a>).</li>
+<li><code>shard</code>: total number of <a href="https://slurm.schedmd.com/gres.html#Sharding">Sharding</a> on the node. Sharding allows sharing the same GPU on multiple jobs. The total number of shards is evenly distributed across all GPUs on the node.</li>
 </ol>
 <p>For some cloud providers, it possible to define additional attributes.
 The following sections present the available attributes per provider.</p>
diff --git a/matrix/index.html b/matrix/index.html
index cbca6677..2ea579f4 100644
--- a/matrix/index.html
+++ b/matrix/index.html
@@ -16,7 +16,7 @@
       
       
       <link rel="icon" href="../assets/images/favicon.png">
-      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.42">
+      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.43">
     
     
       
diff --git a/search/search_index.json b/search/search_index.json
index bb9bcd61..156858a4 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Magic Castle Documentation","text":""},{"location":"#1-setup","title":"1. Setup","text":"<p>To use Magic Castle you will need:</p> <ol> <li>Terraform (&gt;= 1.4.0)</li> <li>Authenticated access to a cloud</li> <li>Ability to communicate with the cloud provider API from your computer</li> <li>A project with operational limits meeting the requirements described in Quotas subsection.</li> </ol>"},{"location":"#11-terraform","title":"1.1 Terraform","text":"<p>To install Terraform, follow the tutorial or go directly on Terraform download page.</p> <p>You can verify Terraform was properly installed by looking at the version in a terminal: <pre><code>terraform version\n</code></pre></p>"},{"location":"#12-authentication","title":"1.2 Authentication","text":""},{"location":"#121-amazon-web-services-aws","title":"1.2.1 Amazon Web Services (AWS)","text":"<ol> <li>Go to AWS - My Security Credentials</li> <li>Create a new access key.</li> <li>In a terminal, export <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code>, environment variables, representing your AWS Access Key and AWS Secret Key:     <pre><code>export AWS_ACCESS_KEY_ID=\"an-access-key\"\nexport AWS_SECRET_ACCESS_KEY=\"a-secret-key\"\n</code></pre></li> </ol> <p>Reference: AWS Provider - Environment Variables</p>"},{"location":"#122-google-cloud","title":"1.2.2 Google Cloud","text":"<ol> <li>Install the Google Cloud SDK</li> <li>In a terminal, enter : <code>gcloud auth application-default login</code></li> </ol>"},{"location":"#123-microsoft-azure","title":"1.2.3 Microsoft Azure","text":"<ol> <li>Install Azure CLI</li> <li>In a terminal, enter : <code>az login</code></li> </ol> <p>Reference : Azure Provider: Authenticating using the Azure CLI</p>"},{"location":"#124-openstack-ovh","title":"1.2.4 OpenStack / OVH","text":"<ol> <li> <p>Download your OpenStack Open RC file. It is project-specific and contains the credentials used by Terraform to communicate with OpenStack API. To download, using OpenStack web page go to: Project \u2192 API Access, then click on Download OpenStack RC File then right-click on OpenStack RC File (Identity API v3), Save Link as..., and save the file.</p> </li> <li> <p>In a terminal located in the same folder as your OpenStack RC file, source the OpenStack RC file:     <pre><code>source *-openrc.sh\n</code></pre> This command will ask for a password, enter your OpenStack password.</p> </li> </ol>"},{"location":"#13-cloud-api","title":"1.3 Cloud API","text":"<p>Once you are authenticated with your cloud provider, you should be able to communicate with its API. This section lists for each provider some instructions to test this.</p>"},{"location":"#131-aws","title":"1.3.1 AWS","text":"<ol> <li>In a dedicated temporary folder, create a file named <code>test_aws.tf</code> with the following content:     <pre><code>provider \"aws\" {\n  region = \"us-east-1\"\n}\n\ndata \"aws_ec2_instance_type\" \"example\" {\n  instance_type = \"t2.micro\"\n}\n</code></pre></li> <li>In a terminal, move to where the file is located, then:     <pre><code>terraform init\n</code></pre></li> <li>Finally, test terraform communication with AWS:     <pre><code>terraform plan\n</code></pre>     If everything is configured properly, terraform will output:     <pre><code>No changes. Your infrastructure matches the configuration.\n</code></pre>     Otherwise, it will output:     <pre><code>Error: error configuring Terraform AWS Provider: no valid credential sources for Terraform AWS Provider found.\n</code></pre></li> <li>You can delete the temporary folder and its content.</li> </ol>"},{"location":"#132-google-cloud","title":"1.3.2 Google Cloud","text":"<p>In a terminal, enter: <pre><code>gcloud projects list\n</code></pre> It should output a table with 3 columns <pre><code>PROJECT_ID NAME PROJECT_NUMBER\n</code></pre></p> <p>Take note of the <code>project_id</code> of the Google Cloud project you want to use, you will need it later.</p>"},{"location":"#133-microsoft-azure","title":"1.3.3 Microsoft Azure","text":"<p>In a terminal, enter: <pre><code>az account show\n</code></pre> It should output a JSON dictionary similar to this: <pre><code>{\n  \"environmentName\": \"AzureCloud\",\n  \"homeTenantId\": \"98467e3b-33c2-4a34-928b-ed254db26890\",\n  \"id\": \"4dda857e-1d61-457f-b0f0-e8c784d1fb20\",\n  \"isDefault\": true,\n  \"managedByTenants\": [],\n  \"name\": \"Pay-As-You-Go\",\n  \"state\": \"Enabled\",\n  \"tenantId\": \"495fc59f-96d9-4c3f-9c78-7a7b5f33d962\",\n  \"user\": {\n    \"name\": \"user@example.com\",\n    \"type\": \"user\"\n  }\n}\n</code></pre></p>"},{"location":"#134-openstack-ovh","title":"1.3.4 OpenStack / OVH","text":"<ol> <li>In a dedicated temporary folder, create a file named <code>test_os.tf</code> with the following content:     <pre><code>terraform {\n  required_providers {\n    openstack = {\n      source  = \"terraform-provider-openstack/openstack\"\n    }\n  }\n}\ndata \"openstack_identity_auth_scope_v3\" \"scope\" {\n  name = \"my_scope\"\n}\n</code></pre></li> <li>In a terminal, move to where the file is located, then:     <pre><code>terraform init\n</code></pre></li> <li>Finally, test terraform communication with OpenStack:     <pre><code>terraform plan\n</code></pre>     If everything is configured properly, terraform will output:     <pre><code>No changes. Your infrastructure matches the configuration.\n</code></pre>     Otherwise, it will output:     <pre><code>Error: Error creating OpenStack identity client:\n</code></pre>     if the OpenStack cloud API cannot be reached.</li> <li>You can delete the temporary folder and its content.</li> </ol>"},{"location":"#14-quotas","title":"1.4 Quotas","text":""},{"location":"#141-aws","title":"1.4.1 AWS","text":"<p>The default quotas set by Amazon are sufficient to build the Magic Castle AWS examples. To increase the limits, or request access to special resources like GPUs or high performance network interface, refer to Amazon EC2 service quotas.</p>"},{"location":"#142-google-cloud","title":"1.4.2 Google Cloud","text":"<p>The default quotas set by Google Cloud are sufficient to build the Magic Castle GCP examples. To increase the limits, or request access to special resources like GPUs, refer to Google Compute Engine Resource quotas.</p>"},{"location":"#143-microsoft-azure","title":"1.4.3 Microsoft Azure","text":"<p>The default quotas set by Microsoft Azure are sufficient to build the Magic Castle Azure examples. To increase the limits, or request access to special resources like GPUs or high performance network interface, refer to Azure subscription and service limits, quotas, and constraints.</p>"},{"location":"#144-openstack","title":"1.4.4 OpenStack","text":"<p>Minimum project requirements:</p> <ul> <li>1 floating IP</li> <li>3 security groups</li> <li>1 network (see note 1)</li> <li>1 subnet (see note 1)</li> <li>1 router (see note 1)</li> <li>3 volumes</li> <li>3 instances</li> <li>8 VCPUs</li> <li>7 neutron ports</li> <li>12 GB of RAM</li> <li>8 security rules</li> <li>80 GB of volume storage</li> </ul> <p>Note 1: Magic Castle supposes the OpenStack project comes with a network, a subnet and a router already initialized. If any of these components is missing, you will need to create them manually before launching terraform.</p> <ul> <li>Create and manage networks, JUSUF user documentation</li> <li>Create and manage network - UI, OpenStack Documentation</li> <li>Create and manage network - CLI, OpenStack Documentation</li> </ul>"},{"location":"#145-ovh","title":"1.4.5 OVH","text":"<p>The default quotas set by OVH are sufficient to build the Magic Castle OVH examples. To increase the limits, or request access to special resources like GPUs, refer to OVHcloud - Increasing Public Cloud quotas.</p>"},{"location":"#2-cloud-cluster-architecture-overview","title":"2. Cloud Cluster Architecture Overview","text":""},{"location":"#3-initialization","title":"3. Initialization","text":""},{"location":"#31-main-file","title":"3.1 Main File","text":"<ol> <li>Go to https://github.com/ComputeCanada/magic_castle/releases.</li> <li>Download the latest release of Magic Castle for your cloud provider.</li> <li>Open a Terminal.</li> <li>Uncompress the release: <code>tar xvf magic_castle*.tar.gz</code></li> <li>Rename the release folder after your favourite superhero: <code>mv magic_castle* hulk</code></li> <li>Move inside the folder: <code>cd hulk</code></li> </ol> <p>The file <code>main.tf</code> contains Terraform modules and outputs. Modules are files that define a set of resources that will be configured based on the inputs provided in the module block. Outputs are used to tell Terraform which variables of our module we would like to be shown on the screen once the resources have been instantiated.</p> <p>This file will be our main canvas to design our new clusters. As long as the module block parameters suffice to our need, we will be able to limit our configuration to this sole file. Further customization will be addressed during the second part of the workshop.</p>"},{"location":"#32-terraform","title":"3.2 Terraform","text":"<p>Terraform fetches the plugins required to interact with the cloud provider defined by our <code>main.tf</code> once when we initialize. To initialize, enter the following command: <pre><code>terraform init\n</code></pre></p> <p>The initialization is specific to the folder where you are currently located. The initialization process looks at all <code>.tf</code> files and fetches the plugins required to build the resources defined in these files. If you replace some or all <code>.tf</code> files inside a folder that has already been initialized, just call the command again to make sure you have all plugins.</p> <p>The initialization process creates a <code>.terraform</code> folder at the root of your current folder. You do not need to look at its content for now.</p>"},{"location":"#321-terraform-modules-upgrade","title":"3.2.1 Terraform Modules Upgrade","text":"<p>Once Terraform folder has been initialized, it is possible to fetch the newest version of the modules used by calling: <pre><code>terraform init -upgrade\n</code></pre></p>"},{"location":"#4-configuration","title":"4. Configuration","text":"<p>In the <code>main.tf</code> file, there is a module named after your cloud provider, i.e.: <code>module \"openstack\"</code>. This module corresponds to the high-level infrastructure of your cluster.</p> <p>The following sections describes each variable that can be used to customize the deployed infrastructure and its configuration. Optional variables can be absent from the example module. The order of the variables does not matter, but the following sections are ordered as the variables appear in the examples.</p>"},{"location":"#41-source","title":"4.1 source","text":"<p>The first line of the module block indicates to Terraform where it can find the files that define the resources that will compose your cluster. In the releases, this variable is a relative path to the cloud provider folder (i.e.: <code>./aws</code>).</p> <p>Requirement: Must be a path to a local folder containing the Magic Castle Terraform files for the cloud provider of your choice. It can also be a git repository. Refer to Terraform documentation on module source for more information.</p> <p>Post build modification effect: <code>terraform init</code> will have to be called again and the next <code>terraform apply</code> might propose changes if the infrastructure describe by the new module is different.</p>"},{"location":"#42-config_git_url","title":"4.2 config_git_url","text":"<p>Magic Castle configuration management is handled by Puppet. The Puppet configuration files are stored in a git repository. This is typically ComputeCanada/puppet-magic_castle repository on GitHub.</p> <p>Leave this variable to its current value to deploy a vanilla Magic Castle cluster.</p> <p>If you wish to customize the instances' role assignment, add services, or develop new features for Magic Castle, fork the ComputeCanada/puppet-magic_castle and point this variable to your fork's URL. For more information on Magic Castle puppet configuration customization, refer to MC developer documentation.</p> <p>Requirement: Must be a valid HTTPS URL to a git repository describing a Puppet environment compatible with Magic Castle. If the repo is private, generate an access token with a permission to read the repo content, and provide the token in the <code>config_git_url</code> like this: <pre><code>config_git_url = \"https://oauth2:${oauth-key-goes-here}@domain.com/username/repo.git\"\n</code></pre> This works for GitHub and GitLab (including community edition).</p> <p>Post build modification effect: no effect. To change the Puppet configuration source, destroy the cluster or change it manually on the Puppet server.</p>"},{"location":"#43-config_version","title":"4.3 config_version","text":"<p>Since Magic Cluster configuration is managed with git, it is possible to specify which version of the configuration you wish to use. Typically, it will match the version number of the release you have downloaded (i.e: <code>9.3</code>).</p> <p>Requirement: Must refer to a git commit, tag or branch existing in the git repository pointed by <code>config_git_url</code>.</p> <p>Post build modification effect: none. To change the Puppet configuration version, destroy the cluster or change it manually on the Puppet server.</p>"},{"location":"#44-cluster_name","title":"4.4 cluster_name","text":"<p>Defines the <code>ClusterName</code> variable in <code>slurm.conf</code> and the name of the cluster in the Slurm accounting database (see <code>slurm.conf</code> documentation).</p> <p>Requirement: Must be lowercase alphanumeric characters and start with a letter. It can include dashes. cluster_name must be 40 characters or less.</p> <p>Post build modification effect: destroy and re-create all instances at next <code>terraform apply</code>.</p>"},{"location":"#45-domain","title":"4.5 domain","text":"<p>Defines</p> <ul> <li>the Kerberos realm name when initializing FreeIPA.</li> <li>the internal domain name and the <code>resolv.conf</code> search domain as <code>int.{cluster_name}.{domain}</code></li> </ul> <p>Optional modules following the current module in the example <code>main.tf</code> can be used to register DNS records in relation to your cluster if the DNS zone of this domain is administered by one of the supported providers. Refer to section 6. DNS Configuration for more details.</p> <p>Requirements:</p> <ul> <li>Must be a fully qualified DNS name and RFC-1035-valid. Valid format is a series of labels 1-63 characters long matching the regular expression <code>[a-z]([-a-z0-9]*[a-z0-9])</code>, concatenated with periods.</li> <li>No wildcard record A of the form <code>*.domain. IN A x.x.x.x</code> exists for that domain. You can verify no such record exist with <code>dig</code>:     <pre><code>dig +short '*.${domain}'\n</code></pre></li> </ul> <p>Post build modification effect: destroy and re-create all instances at next <code>terraform apply</code>.</p>"},{"location":"#46-image","title":"4.6 image","text":"<p>Defines the name of the image that will be used as the base image for the cluster nodes.</p> <p>You can use a custom image if you wish, but configuration management should be mainly done through Puppet. Image customization is mostly envisioned as a way to accelerate the configuration process by applying the security patches and OS updates in advance.</p> <p>To specify a different image for an instance type, use the <code>image</code> instance attribute</p> <p>Requirements: the operating system on the image must be from the RedHat family. This includes CentOS (8, 9), Rocky Linux (8, 9), and AlmaLinux (8, 9).</p> <p>Post build modification effect: none. If this variable is modified, existing instances will ignore the change and future instances will use the new value.</p>"},{"location":"#461-aws","title":"4.6.1 AWS","text":"<p>The image field needs to correspond to the Amazon Machine Image (AMI) ID. AMI IDs are specific to regions and architectures. Make sure to use the right ID for the region and CPU architecture you are using (i.e: x86_64).</p> <p>To find out which AMI ID you need to use, refer to - AlmaLinux OS Amazon Web Services AMIs - CentOS list of official images available on the AWS Marketplace - Rocky Linux</p> <p>Note: Before you can use the AMI, you will need to accept the usage terms and subscribe to the image on AWS Marketplace. On your first deployment, you will be presented an error similar to this one: <pre><code>\u2502 Error: Error launching source instance: OptInRequired: In order to use this AWS Marketplace product you need to accept terms and subscribe. To do so please visit https://aws.amazon.com/marketplace/pp?sku=cvugziknvmxgqna9noibqnnsy\n\u2502   status code: 401, request id: 1f04a85a-f16a-41c6-82b5-342dc3dd6a3d\n\u2502\n\u2502   on aws/infrastructure.tf line 67, in resource \"aws_instance\" \"instances\":\n\u2502   67: resource \"aws_instance\" \"instances\" {\n</code></pre> To accept the terms and fix the error, visit the link provided in the error output, then click on the <code>Click to Subscribe</code> yellow button.</p>"},{"location":"#462-microsoft-azure","title":"4.6.2 Microsoft Azure","text":"<p>The image field for Azure can either be a string or a map.</p> <p>A string image specification will correspond to the image id. Image ids can be retrieved using the following command-line: <pre><code>az image builder list\n</code></pre></p> <p>A map image specification needs to contain the following fields <code>publisher</code>, <code>offer</code> <code>sku</code>, and optionally <code>version</code>. The map is used to specify images found in Azure Marketplace. Here is an example: <pre><code>{\n    publisher = \"OpenLogic\",\n    offer     = \"CentOS-CI\",\n    sku       = \"7-CI\"\n}\n</code></pre></p>"},{"location":"#463-openstack","title":"4.6.3 OpenStack","text":"<p>The image name can be a regular expression. If more than one image is returned by the query to OpenStack, the most recent is selected.</p>"},{"location":"#47-instances","title":"4.7 instances","text":"<p>The <code>instances</code> variable is a map that defines the virtual machines that will form the cluster. The map' keys define the hostnames and the values are the attributes of the virtual machines.</p> <p>Each instance is identified by a unique hostname. An instance's hostname is written as the key followed by its index (1-based). The following map: <pre><code>instances = {\n  mgmt     = { type = \"p2-4gb\", tags = [...] },\n  login    = { type = \"p2-4gb\",     count = 1, tags = [...] },\n  node     = { type = \"c2-15gb-31\", count = 2, tags = [...] },\n  gpu-node = { type = \"gpu2.large\", count = 3, tags = [...] },\n}\n</code></pre> will spawn instances with the following hostnames: <pre><code>mgmt1\nlogin1\nnode1\nnode2\ngpu-node1\ngpu-node2\ngpu-node3\n</code></pre></p> <p>Hostnames must follow a set of rules, from <code>hostname</code> man page:</p> <p>Valid characters for hostnames are ASCII letters from a to z, the digits from 0 to 9, and the hyphen (-). A hostname may not start with a hyphen.</p> <p>Two attributes are expected to be defined for each instance: 1. <code>type</code>: name for varying combinations of CPU, memory, GPU, etc. (i.e: <code>t2.medium</code>); 2. <code>tags</code>: list of labels that defines the role of the instance.</p>"},{"location":"#471-tags","title":"4.7.1 tags","text":"<p>Tags are used in the Terraform code to identify if devices (volume, network) need to be attached to an instance, while in Puppet code tags are used to identify roles of the instances.</p> <p>Terraform tags:</p> <ul> <li><code>login</code>: identify instances accessible with SSH from Internet and pointed by the domain name A records</li> <li><code>pool</code>: identify instances created only when their hostname appears in the <code>var.pool</code> list.</li> <li><code>proxy</code>: identify instances accessible with HTTP/HTTPS and pointed by the vhost A records</li> <li><code>public</code>: identify instances that need to have a public ip address reachable from Internet</li> <li><code>puppet</code>: identify instances configured as Puppet servers</li> <li><code>spot</code>: identify instances that are to be spawned as spot/preemptible instances. This tag is supported in AWS, Azure and GCP. It is ignored by OpenStack and OVH.</li> <li><code>efa</code>: attach an Elastic Fabric Adapter network interface to the instance. This tag is supported in AWS.</li> </ul> <p>Puppet tags expected by the puppet-magic_castle environment.</p> <ul> <li><code>login</code>: identify a login instance (minimum: 2 CPUs, 2GB RAM)</li> <li><code>mgmt</code>: identify a management instance i.e: FreeIPA server, Slurm controller, Slurm DB (minimum: 2 CPUs, 6GB RAM)</li> <li><code>nfs</code>: identify the instance that acts as an NFS server.</li> <li><code>node</code>: identify a compute node instance (minimum: 1 CPUs, 2GB RAM)</li> <li><code>pool</code>: when combined with <code>node</code>, it identifies compute nodes that Slurm can resume/suspend to meet workload demand.</li> <li><code>proxy</code>: identify the instance that executes the Caddy reverse proxy and JupyterHub.</li> </ul> <p>In the Magic Castle Puppet environment, an instance cannot be tagged as <code>mgmt</code> and <code>proxy</code>.</p> <p>You are free to define your own additional tags.</p>"},{"location":"#472-optional-attributes","title":"4.7.2 Optional attributes","text":"<p>Optional attributes can be defined:</p> <ol> <li><code>count</code>: number of virtual machines with this combination of hostname prefix, type and tags to create (default: 1).</li> <li><code>image</code>: specification of the image to use for this instance type. (default: global <code>image</code> value). Refer to section 10.12 - Create a compute node image to learn how this attribute can be leveraged to accelerate compute node configuration.</li> <li> <p><code>disk_type</code>: type of the instance's root disk (default: see the next table).</p> Provider <code>disk_type</code> <code>disk_size</code> (GiB) Azure <code>Premium_LRS</code> 30 AWS <code>gp2</code> 10 GCP <code>pd-ssd</code> 20 OpenStack <code>null</code> 10 OVH <code>null</code> 10 </li> <li> <p><code>disk_size</code>: size in gibibytes (GiB) of the instance's root disk containing the operating system and service software (default: see the previous table).</p> </li> <li><code>mig</code>: map of NVIDIA Multi-Instance GPU (MIG) short profile names and count used to partition the instances' GPU, example for an A100:     <pre><code>mig = { \"1g.5gb\" = 2, \"2g.10gb\" = 1, \"3g.20gb\" = 1 }\n</code></pre>     This is only functional with MIG supported GPUs,     and with x86-64 processors (see NVIDIA/mig-parted issue #30).</li> </ol> <p>For some cloud providers, it possible to define additional attributes. The following sections present the available attributes per provider.</p>"},{"location":"#aws","title":"AWS","text":"<p>For instances with the <code>spot</code> tags, these attributes can also be set:</p> <ul> <li><code>wait_for_fulfillment</code> (default: true)</li> <li><code>spot_type</code> (default: permanent)</li> <li><code>instance_interruption_behavior</code> (default: stop)</li> <li><code>spot_price</code> (default: not set)</li> <li><code>block_duration_minutes</code> (default: not set) [note 1] For more information on these attributes, refer to <code>aws_spot_instance_request</code> argument reference</li> </ul> <p>Note 1: <code>block_duration_minutes</code> is not available to new AWS accounts or accounts without billing history - AWS EC2 Spot Instance requests. When not available, its usage can trigger quota errors like this: <pre><code>Error requesting spot instances: MaxSpotInstanceCountExceeded: Max spot instance count exceeded\n</code></pre></p>"},{"location":"#azure","title":"Azure","text":"<p>For instances with the <code>spot</code> tags, these attributes can also be set:</p> <ul> <li><code>max_bid_price</code> (default: not set)</li> <li><code>eviction_policy</code> (default: <code>Deallocate</code>) For more information on these attributes, refer to <code>azurerm_linux_virtual_machine</code> argument reference</li> </ul>"},{"location":"#gcp","title":"GCP","text":"<ul> <li><code>gpu_type</code>: name of the GPU model to attach to the instance. Refer to Google Cloud documentation for the list of available models per region</li> <li><code>gpu_count</code>: number of GPUs of the <code>gpu_type</code> model to attach to the instance</li> </ul>"},{"location":"#473-post-build-modification-effect","title":"4.7.3 Post build modification effect","text":"<p>Modifying any part of the map after the cluster is built will only affect the type of instances associated with what was modified at the next <code>terraform apply</code>.</p>"},{"location":"#48-volumes","title":"4.8 volumes","text":"<p>The <code>volumes</code> variable is a map that defines the block devices that should be attached to instances that have the corresponding key in their list of tags. To each instance with the tag, unique block devices are attached, no multi-instance attachment is supported.</p> <p>Each volume in map is defined a key corresponding to its and a map of attributes:</p> <ul> <li><code>size</code>: size of the block device in GB.</li> <li><code>type</code> (optional): type of volume to use. Default value per provider:</li> <li>Azure: <code>Premium_LRS</code></li> <li>AWS: <code>gp2</code></li> <li>GCP: <code>pd-ssd</code></li> <li>OpenStack: <code>null</code></li> <li>OVH: <code>null</code></li> </ul> <p>Volumes with a tag that have no corresponding instance will not be created.</p> <p>In the following example: <pre><code>instances = {\u00a0\n  server = { type = \"p4-6gb\", tags = [\"nfs\"] }\n}\nvolumes = {\n  nfs = {\n    home = { size = 100 }\n    project = { size = 100 }\n    scratch = { size = 100 }\n  }\n  mds = {\n    oss1 = { size = 500 }\n    oss2 = { size = 500 }\n  }\n}\n</code></pre></p> <p>The instance <code>server1</code> will have three volumes attached to it. The volumes tagged <code>mds</code> are not created since no instances have the corresponding tag.</p> <p>To define an infrastructure with no volumes, set the <code>volumes</code> variable to an empty map: <pre><code>volumes = {}\n</code></pre></p> <p>Post build modification effect: destruction of the corresponding volumes and attachments, and creation of new empty volumes and attachments. If an no instance with a corresponding tag exist following modifications, the volumes will be deleted.</p>"},{"location":"#49-public_keys","title":"4.9 public_keys","text":"<p>List of SSH public keys that will have access to your cluster sudoer account.</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. The sudoer account <code>authorized_keys</code> file will be updated by each instance's Puppet agent following the copy of the hieradata files.</p>"},{"location":"#410-nb_users-optional","title":"4.10 nb_users (optional)","text":"<p>default value: 0</p> <p>Defines how many guest user accounts will be created in FreeIPA. Each user account shares the same randomly generated password. The usernames are defined as <code>userX</code> where <code>X</code> is a number between 1 and the value of <code>nb_users</code> (zero-padded, i.e.: <code>user01 if X &lt; 100</code>, <code>user1 if X &lt; 10</code>).</p> <p>If an NFS NFS <code>home</code> volume is defined, each user will have a home folder on a shared NFS storage hosted on the NFS server node.</p> <p>User accounts do not have sudoer privileges. If you wish to use <code>sudo</code>, you will have to login using the sudoer account and the SSH keys listed in <code>public_keys</code>.</p> <p>If you would like to add a user account after the cluster is built, refer to section 10.3 and 10.4.</p> <p>Requirement: Must be an integer, minimum value is 0.</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. If <code>nb_users</code> is increased, new guest accounts will be created during the following Puppet run on <code>mgmt1</code>. If <code>nb_users</code> is decreased, it will have no effect: the guest accounts already created will be left intact.</p>"},{"location":"#411-guest_passwd-optional","title":"4.11 guest_passwd (optional)","text":"<p>default value: 4 random words separated by dots</p> <p>Defines the password for the guest user accounts instead of using a randomly generated one.</p> <p>Requirement: Minimum length 8 characters.</p> <p>The password can be provided in a PKCS7 encrypted form. Refer to sub-section 4.15 eyaml_key for instructions on how to encrypt the password.</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. Password of all guest accounts will be changed to match the new password value.</p>"},{"location":"#412-sudoer_username-optional","title":"4.12 sudoer_username (optional)","text":"<p>default value: <code>centos</code></p> <p>Defines the username of the account with sudo privileges. The account ssh authorized keys are configured with the SSH public keys with <code>public_keys</code>.</p> <p>Post build modification effect: none. To change sudoer username, destroy the cluster or redefine the value of <code>profile::base::sudoer_username</code> in <code>hieradata</code>.</p>"},{"location":"#413-hieradata-optional","title":"4.13 hieradata (optional)","text":"<p>default value: empty string</p> <p>Defines custom variable values that are injected in the Puppet hieradata file. Useful to override common configuration of Puppet classes.</p> <p>List of useful examples:</p> <ul> <li>Receive logs of Puppet runs with changes to your email, add the following line to the string:     <pre><code>profile::base::admin_email: \"me@example.org\"\n</code></pre></li> <li>Define ip addresses that can never be banned by fail2ban:     <pre><code>profile::fail2ban::ignore_ip: ['132.203.0.0/16', '8.8.8.8']\n</code></pre></li> <li>Remove one-time password field from JupyterHub login page:     <pre><code>jupyterhub::enable_otp_auth: false\n</code></pre></li> <li>Setup AlertManager to receive Prometheus alerts on Slack:     <pre><code>prometheus::alertmanager::route:\n  group_by:\n    - 'alertname'\n    - 'cluster'\n    - 'service'\n  group_wait: '5s'\n  group_interval: '5m'\n  repeat_interval: '3h'\n  receiver: 'slack'\n\nprometheus::alertmanager::receivers:\n  - name: 'slack'\n    slack_configs:\n      - api_url: 'https://hooks.slack.com/services/ABCDEFG123456'\n        channel: \"#channel\"\n        send_resolved: true\n        username: 'username'\n</code></pre></li> </ul> <p>Refer to the following Puppet modules' documentation to know more about the key-values that can be defined:</p> <ul> <li>puppet-magic_castle</li> <li>puppet-jupyterhub</li> <li>puppet-prometheus</li> </ul> <p>The file created from this string can be found on the Puppet server as <code>/etc/puppetlabs/data/user_data.yaml</code></p> <p>Requirement: The string needs to respect the YAML syntax.</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. Each instance's Puppet agent will be reloaded following the copy of the hieradata files.</p>"},{"location":"#414-hieradata_dir-optional","title":"4.14 hieradata_dir (optional)","text":"<p>default_value: Empty string</p> <p>Defines the path to a directory containing a hierarchy of YAML data files. The hierarchy is copied on the Puppet server in <code>/etc/puppetlabs/data/user_data</code>.</p> <p>Hierarchy structure:</p> <ul> <li>per node hostname:</li> <li><code>&lt;dir&gt;/hostnames/&lt;hostname&gt;/*.yaml</code></li> <li><code>&lt;dir&gt;/hostnames/&lt;hostname&gt;.yaml</code></li> <li>per node prefix:</li> <li><code>&lt;dir&gt;/prefixes/&lt;prefix&gt;/*.yaml</code></li> <li><code>&lt;dir&gt;/prefixes/&lt;prefix&gt;.yaml</code></li> <li>all nodes: <code>&lt;dir&gt;/*.yaml</code></li> </ul> <p>For more information on hieradata, refer to section 4.13 hieradata (optional).</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. Each instance's Puppet agent will be reloaded following the copy of the hieradata files.</p>"},{"location":"#415-eyaml_key-optional","title":"4.15 eyaml_key (optional)","text":"<p>default value: empty string</p> <p>Defines the private RSA key required to decrypt the values encrypted with hiera-eyaml PKCS7. This key will be copied on the Puppet server.</p> <p>Post build modification effect: trigger scp of private key file at next <code>terraform apply</code>.</p>"},{"location":"#4151-generate-eyaml-encryption-keys","title":"4.15.1 Generate eyaml encryption keys","text":"<p>If you plan to track the cluster configuration files in git (i.e:<code>main.tf</code>, <code>user_data.yaml</code>), it would be a good idea to encrypt the sensitive property values.</p> <p>Magic Castle uses hiera-eyaml to provide a per-value encryption of sensitive properties to be used by Puppet.</p> <p>The private key and its corresponding public key wrapped in a X509 certificate can be generated with <code>openssl</code>:</p> <pre><code>openssl req -x509 -nodes -newkey rsa:2048 -keyout private_key.pkcs7.pem -out public_key.pkcs7.pem -batch\n</code></pre> <p>or with <code>eyaml</code>:</p> <pre><code>eyaml createkeys --pkcs7-public-key=public_key.pkcs7.pem --pkcs7-private-key=private_key.pkcs7.pem\n</code></pre>"},{"location":"#4152-encrypting-sensitive-properties","title":"4.15.2 Encrypting sensitive properties","text":"<p>To encrypt a sensitive property with openssl: <pre><code>echo -n 'your-secret' | openssl smime -encrypt -aes-256-cbc -outform der public_key.pkcs7.pem | openssl base64 -A | xargs printf \"ENC[PKCS7,%s]\\n\"\n</code></pre></p> <p>To encrypt a sensitive property with eyaml: <pre><code>eyaml encrypt -s 'your-secret' --pkcs7-public-key=public_key.pkcs7.pem -o string\n</code></pre></p>"},{"location":"#4153-terraform-cloud","title":"4.15.3 Terraform cloud","text":"<p>To provide the value of this variable via Terraform Cloud, encode the private key content with base64:</p> <pre><code>openssl base64 -A -in private_key.pkcs7.pem\n</code></pre> <p>Define a variable in your main.tf:</p> <pre><code>variable \"tfc_eyaml_key\" {}\nmodule \"openstack\" {\n  ...\n}\n</code></pre> <p>Then make sure to decode it before passing it to the cloud provider module:</p> <pre><code>variable \"tfc_eyaml_key\" {}\nmodule \"openstack\" {\n  ...\n  eyaml_key = base64decode(var.tfc_eyaml_key)\n  ...\n}\n</code></pre>"},{"location":"#416-firewall_rules-optional","title":"4.16 firewall_rules (optional)","text":"<p>default value: <pre><code>{\n  ssh     = { \"from_port\" = 22,    \"to_port\" = 22,    tag = \"login\", \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" },\n  http    = { \"from_port\" = 80,    \"to_port\" = 80,    tag = \"proxy\", \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" },\n  https   = { \"from_port\" = 443,   \"to_port\" = 443,   tag = \"proxy\", \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" },\n  globus  = { \"from_port\" = 2811,  \"to_port\" = 2811,  tag = \"dtn\",   \"protocol\" = \"tcp\", \"cidr\" = \"54.237.254.192/29\" },\n  myproxy = { \"from_port\" = 7512,  \"to_port\" = 7512,  tag = \"dtn\",   \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" },\n  gridftp = { \"from_port\" = 50000, \"to_port\" = 51000, tag = \"dtn\",   \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" }\n}\n</code></pre></p> <p>Defines a map of firewall rules that control external traffic to the public nodes. Each rule is defined as a map of key-value pairs and has to be assigned a unique name:</p> <ul> <li><code>from_port</code> (req.):  the lower part of the allowed port range, valid integer value needs to be between 1 and 65535.</li> <li><code>to_port</code> (req.): the higher part of the allowed port range, valid integer value needs to be between 1 and 65535.</li> <li><code>tag</code> (req.): instances with this tag will be assigned this firewall rule.</li> <li><code>ethertype</code> (opt. default: <code>\"IPv4\"</code>): the layer 3 protocol type (<code>\"IPv4\"</code> or <code>\"IPv6\"</code>).</li> <li><code>protocol</code> (opt. default: <code>\"tcp\"</code>): the layer 4 protocol type.</li> <li><code>cidr</code> (opt. default: <code>\"0.0.0.0/0\"</code>): the remote CIDR, the value needs to be a valid CIDR (i.e. <code>192.168.0.0/16</code>).</li> </ul> <p>If you would like Magic Castle to be able to transfer files and update the state of the cluster in Puppet, make sure there exists at least one effective firewall rule where <code>from_port &lt;= 22 &lt;= to_port</code> and for which the external IP address of the machine that executes Terraform is in the CIDR range (i.e: <code>cidr = \"0.0.0.0/0\"</code> being the most permissive). This corresponds to the <code>ssh</code> rule in the default firewall rule map. This guarantees that Terraform will be able to use SSH to connect to the cluster from anywhere. For more information about this requirement, refer to Magic Castle's bastion tag computation code.</p> <p>Post build modification effect: modify the cloud provider firewall rules at next <code>terraform apply</code>.</p>"},{"location":"#418-software_stack-optional","title":"4.18 software_stack (optional)","text":"<p>default_value: <code>\"alliance\"</code></p> <p>Defines the scientific software environment that users have access when they login. Possible values are:</p> <ul> <li>default - <code>\"alliance\"</code> / <code>\"computecanada\"</code>: Digital Alliance Research Alliance of Canada scientific software environment (previously Compute Canada environment)</li> <li><code>\"eessi\"</code>: European Environment for Scientific Software Installation (EESSI)</li> <li><code>null</code> / <code>\"\"</code>: no scientific software environment</li> </ul> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>.</p>"},{"location":"#419-pool-optional","title":"4.19 pool (optional)","text":"<p>default_value: <code>[]</code></p> <p>Defines a list of hostnames with the tag <code>\"pool\"</code> that have to be online. This variable is typically managed by the workload scheduler through Terraform API. For more information, refer to Enable Magic Castle Autoscaling</p> <p>Post build modification effect: <code>pool</code> tagged hosts with name present in the list will be instantiated, others will stay uninstantiated or will be destroyed if previously instantiated.</p>"},{"location":"#420-skip_upgrade-optional","title":"4.20 skip_upgrade (optional)","text":"<p>default_value = <code>false</code></p> <p>If true, the base image packages will not be upgraded during the first boot. By default, all packages are upgraded.</p> <p>Post build modification effect: No effect on currently built instances. Ones created after the modification will take into consideration the new value of the parameter to determine whether they should upgrade the base image packages or not.</p>"},{"location":"#421-puppetfile-optional","title":"4.21 puppetfile (optional)","text":"<p>default_value = <code>\"\"</code></p> <p>Defines a second Puppetfile used to install complementary modules with r10k.</p> <p>Post build modification effect: trigger scp of Puppetfile at next <code>terraform apply</code>. Each instance's Puppet agent will be reloaded following the installation of the new modules.</p>"},{"location":"#5-cloud-specific-configuration","title":"5. Cloud Specific Configuration","text":""},{"location":"#51-amazon-web-services","title":"5.1 Amazon Web Services","text":""},{"location":"#511-region","title":"5.1.1 region","text":"<p>Defines the label of the AWS EC2 region where the cluster will be created (i.e.: <code>us-east-2</code>).</p> <p>Requirement: Must be in the list of available EC2 regions.</p> <p>Post build modification effect: rebuild of all resources at next <code>terraform apply</code>.</p>"},{"location":"#512-availability_zone-optional","title":"5.1.2 availability_zone (optional)","text":"<p>default value: None</p> <p>Defines the label of the data center inside the AWS region where the cluster will be created (i.e.: <code>us-east-2a</code>). If left blank, it chosen at random amongst the availability zones of the selected region.</p> <p>Requirement: Must be in a valid availability zone for the selected region. Refer to AWS documentation to find out how list the availability zones.</p>"},{"location":"#52-microsoft-azure","title":"5.2 Microsoft Azure","text":""},{"location":"#521-location","title":"5.2.1 location","text":"<p>Defines the label of the Azure location where the cluster will be created (i.e.: <code>eastus</code>).</p> <p>Requirement: Must be a valid Azure location. To get the list of available location, you can use Azure CLI : <code>az account list-locations -o table</code>.</p> <p>Post build modification effect: rebuild of all resources at next <code>terraform apply</code>.</p>"},{"location":"#522-azure_resource_group-optional","title":"5.2.2 azure_resource_group (optional)","text":"<p>default value: None</p> <p>Defines the name of an already created resource group to use. Terraform will no longer attempt to manage a resource group for Magic Castle if this variable is defined and will instead create all resources within the provided resource group. Define this if you wish to use an already created resource group or you do not have a subscription-level access to create and destroy resource groups.</p> <p>Post build modification effect: rebuild of all instances at next <code>terraform apply</code>.</p>"},{"location":"#523-plan-optional","title":"5.2.3 plan (optional)","text":"<p>default value: <pre><code>{\n  name      = null\n  product   = null\n  publisher = null\n}\n</code></pre></p> <p>Purchase plan information for Azure Marketplace image. Certain images from Azure Marketplace requires a terms acceptance or a fee to be used. When using this kind of image, you must supply the plan details.</p> <p>For example, to use the official AlmaLinux image, you have to first add it to your account. Then to use it with Magic Castle, you must supply the following plan information: <pre><code>plan = {\n  name      = \"8_7\"\n  product   = \"almalinux\"\n  publisher = \"almalinux\"\n}\n</code></pre></p>"},{"location":"#53-google-cloud","title":"5.3 Google Cloud","text":""},{"location":"#531-project","title":"5.3.1 project","text":"<p>Defines the label of the unique identifier associated with the Google Cloud project in which the resources will be created. It needs to corresponds to GCP project ID, which is composed of the project name and a randomly assigned number.</p> <p>Requirement: Must be a valid Google Cloud project ID.</p> <p>Post build modification effect: rebuild of all resources at next <code>terraform apply</code>.</p>"},{"location":"#532-region","title":"5.3.2 region","text":"<p>Defines the name of the specific geographical location where the cluster resources will be hosted.</p> <p>Requirement: Must be a valid Google Cloud region. Refer to Google Cloud documentation for the list of available regions and their characteristics.</p>"},{"location":"#533-zone-optional","title":"5.3.3 zone (optional)","text":"<p>default value: None</p> <p>Defines the name of the zone within the region where the cluster resources will be hosted.</p> <p>Requirement: Must be a valid Google Cloud zone. Refer to Google Cloud documentation for the list of available zones and their characteristics.</p>"},{"location":"#54-openstack-and-ovh","title":"5.4 OpenStack and OVH","text":""},{"location":"#541-os_floating_ips-optional","title":"5.4.1 os_floating_ips (optional)","text":"<p>default value: <code>{}</code></p> <p>Defines a map as an association of instance names (key) to pre-allocated floating ip addresses (value). Example: <pre><code>  os_floating_ips = {\n    login1 = \"132.213.13.59\"\n    login2 = \"132.213.13.25\"\n  }\n</code></pre></p> <ul> <li>instances tagged as public that have an entry in this map will be assigned the corresponding ip address;</li> <li>instances tagged as public that do not have an entry in this map will be assigned a floating ip managed by Terraform.</li> <li>instances not tagged as public that have an entry in this map will not be assigned a floating ip.</li> </ul> <p>This variable can be useful if you manage your DNS manually and you would like the keep the same domain name for your cluster at each build.</p> <p>Post build modification effect: change the floating ips assigned to the public instances.</p>"},{"location":"#542-os_ext_network-optional","title":"5.4.2 os_ext_network (optional)","text":"<p>default value: None</p> <p>Defines the name of the external network that provides the floating ips. Define this only if your OpenStack cloud provides multiple external networks, otherwise, Terraform can find it automatically.</p> <p>Post build modification effect: change the floating ips assigned to the public nodes.</p>"},{"location":"#544-subnet_id-optional","title":"5.4.4 subnet_id (optional)","text":"<p>default value: None</p> <p>Defines the ID of the internal IPV4 subnet to which the instances are connected. Define this if you have or intend to have more than one subnets defined in your OpenStack project. Otherwise, Terraform can find it automatically. Can be used to force a v4 subnet when both v4 and v6 exist.</p> <p>Post build modification effect: rebuild of all instances at next <code>terraform apply</code>.</p>"},{"location":"#6-dns-configuration","title":"6. DNS Configuration","text":"<p>Some functionalities in Magic Castle require the registration of DNS records under the cluster name in the selected domain. This includes web services like JupyterHub, Mokey and FreeIPA web portal.</p> <p>If your domain DNS records are managed by one of the supported providers, follow the instructions in the corresponding sections to have the cluster's DNS records created and tracked by Magic Castle.</p> <p>If your DNS provider is not supported, you can manually create the records. Refer to the subsection 6.3 for more details.</p>"},{"location":"#61-cloudflare","title":"6.1 Cloudflare","text":"<ol> <li>Uncomment the <code>dns</code> module for Cloudflare in your <code>main.tf</code>.</li> <li>Uncomment the <code>output \"hostnames\"</code> block.</li> <li>Download and install the Cloudflare Terraform module: <code>terraform init</code>.</li> <li>Export the environment variables <code>CLOUDFLARE_EMAIL</code> and <code>CLOUDFLARE_API_KEY</code>, where <code>CLOUDFLARE_EMAIL</code> is your Cloudflare account email address and <code>CLOUDFLARE_API_KEY</code> is your account Global API Key available in your Cloudflare profile.</li> </ol>"},{"location":"#612-cloudflare-api-token","title":"6.1.2 Cloudflare API Token","text":"<p>If you prefer using an API token instead of the global API key, you will need to configure a token with the following four permissions with the Cloudflare API Token interface.</p> Section Subsection Permission Zone DNS Edit <p>Instead of step 5, export only <code>CLOUDFLARE_API_TOKEN</code>, <code>CLOUDFLARE_ZONE_API_TOKEN</code>, and <code>CLOUDFLARE_DNS_API_TOKEN</code> equal to the API token generated previously.</p>"},{"location":"#62-google-cloud","title":"6.2 Google Cloud","text":"<p>requirement: Install the Google Cloud SDK</p> <ol> <li>Login to your Google account with gcloud CLI : <code>gcloud auth application-default login</code></li> <li>Uncomment the <code>dns</code> module for Google Cloud in your <code>main.tf</code>.</li> <li>Uncomment the <code>output \"hostnames\"</code> block.</li> <li>In <code>main.tf</code>'s <code>dns</code> module, configure the variables <code>project</code> and <code>zone_name</code> with their respective values as defined by your Google Cloud project.</li> <li>Download and install the Google Cloud Terraform module: <code>terraform init</code>.</li> </ol>"},{"location":"#63-unsupported-providers","title":"6.3 Unsupported providers","text":"<p>If your DNS provider is not currently supported by Magic Castle, you can create the DNS records manually.</p> <p>Magic Castle provides a module that creates a text file with the DNS records that can then be imported manually in your DNS zone. To use this module, add the following snippet to your <code>main.tf</code>:</p> <pre><code>module \"dns\" {\n    source           = \"./dns/txt\"\n    name             = module.openstack.cluster_name\n    domain           = module.openstack.domain\n    public_instances = module.openstack.public_instances\n}\n</code></pre> <p>Find and replace <code>openstack</code> in the previous snippet by your cloud provider of choice if not OpenStack (i.e: <code>aws</code>, <code>gcp</code>, etc.).</p> <p>The file will be created after the <code>terraform apply</code> in the same folder as your <code>main.tf</code> and will be named as <code>${name}.${domain}.txt</code>.</p>"},{"location":"#65-sshfp-records-and-dnssec","title":"6.5 SSHFP records and DNSSEC","text":"<p>Magic Castle DNS module creates SSHFP records for all instances with a public ip address. These records can be used by SSH clients to verify the SSH host keys of the server. If DNSSEC is enabled for the domain and the SSH client is correctly configured, no host key confirmation will be prompted when connecting to the server.</p> <p>For more information on how to activate DNSSEC, refer to your DNS provider documentation:</p> <ul> <li>CloudFlare - Enable DNSSEC</li> <li>Google Cloud - Manage DNSSEC configuration</li> </ul> <p>To setup an SSH client to use SSHFP records, add <pre><code>VerifyHostKeyDNS yes\n</code></pre> to its configuration file (i.e.: <code>~/.ssh/config</code>).</p>"},{"location":"#7-planning","title":"7. Planning","text":"<p>Once your initial cluster configuration is done, you can initiate a planning phase where you will ask Terraform to communicate with your cloud provider and verify that your cluster can be built as it is described by the <code>main.tf</code> configuration file.</p> <p>Terraform should now be able to communicate with your cloud provider. To test your configuration file, enter the following command <pre><code>terraform plan\n</code></pre></p> <p>This command will validate the syntax of your configuration file and communicate with the provider, but it will not create new resources. It is only a dry-run. If Terraform does not report any error, you can move to the next step. Otherwise, read the errors and fix your configuration file accordingly.</p>"},{"location":"#8-deployment","title":"8. Deployment","text":"<p>To create the resources defined by your main, enter the following command <pre><code>terraform apply\n</code></pre></p> <p>The command will produce the same output as the <code>plan</code> command, but after the output it will ask for a confirmation to perform the proposed actions. Enter <code>yes</code>.</p> <p>Terraform will then proceed to create the resources defined by the configuration file. It should take a few minutes. Once the creation process is completed, Terraform will output the guest account usernames and password, the sudoer username and the floating ip of the login node.</p> <p>Warning: although the instance creation process is finished once Terraform outputs the connection information, you will not be able to connect and use the cluster immediately. The instance creation is only the first phase of the cluster-building process. The configuration: the creation of the user accounts, installation of FreeIPA, Slurm, configuration of JupyterHub, etc.; takes around 15 minutes after the instances are created.</p> <p>Once it is booted, you can follow an instance configuration process by looking at:</p> <ul> <li><code>/var/log/cloud-init-output.log</code></li> <li><code>journalctl -u puppet</code></li> </ul> <p>If unexpected problems occur during configuration, you can provide these logs to the authors of Magic Castle to help you debug.</p>"},{"location":"#81-deployment-customization","title":"8.1 Deployment Customization","text":"<p>You can modify the <code>main.tf</code> at any point of your cluster's life and apply the modifications while it is running.</p> <p>Warning: Depending on the variables you modify, Terraform might destroy some or all resources, and create new ones. The effects of modifying each variable are detailed in the subsections of Configuration.</p> <p>For example, to increase the number of computes nodes by one. Open <code>main.tf</code>, add 1 to <code>node</code>'s <code>count</code> , save the document and call <pre><code>terraform apply\n</code></pre></p> <p>Terraform will analyze the difference between the current state and the future state, and plan the creation of a single new instance. If you accept the action plan, the instance will be created, provisioned and eventually automatically add to the Slurm cluster configuration.</p> <p>You could do the opposite and reduce the number of compute nodes to 0.</p>"},{"location":"#9-destruction","title":"9. Destruction","text":"<p>Once you're done working with your cluster and you would like to recover the resources, in the same folder as <code>main.tf</code>, enter: <pre><code>terraform destroy -refresh=false\n</code></pre></p> <p>The <code>-refresh=false</code>\u00a0flag is to avoid an issue where one or many of the data sources return no results and stall the cluster destruction with a message like the following: <pre><code>Error: Your query returned no results. Please change your search criteria and try again.\n</code></pre> This type of error happens when for example the specified image no longer exists (see issue #40).</p> <p>As for <code>apply</code>, Terraform will output a plan that you will have to confirm by entering <code>yes</code>.</p> <p>Warning: once the cluster is destroyed, nothing will be left, even the shared storage will be erased.</p>"},{"location":"#91-instance-destruction","title":"9.1 Instance Destruction","text":"<p>It is possible to destroy only the instances and keep the rest of the infrastructure like the floating ip, the volumes, the generated SSH host key, etc. To do so, set the count value of the instance type you wish to destroy to 0.</p>"},{"location":"#92-reset","title":"9.2 Reset","text":"<p>On some occasions, it is desirable to rebuild some of the instances from scratch. Using <code>terraform taint</code>, you can designate resources that will be rebuilt at next application of the plan.</p> <p>To rebuild the first login node : <pre><code>terraform taint 'module.openstack.openstack_compute_instance_v2.instances[\"login1\"]'\nterraform apply\n</code></pre></p>"},{"location":"#10-customize-cluster-software-configuration","title":"10. Customize Cluster Software Configuration","text":"<p>Once the cluster is online and configured, you can modify its configuration as you see fit. We list here how to do most commonly asked for customizations.</p> <p>Some customizations are done from the Puppet server instance (<code>puppet</code>). To connect to the puppet server, follow these steps:</p> <ol> <li>Make sure your SSH key is loaded in your ssh-agent.</li> <li>SSH in your cluster with forwarding of the authentication agent connection enabled: <code>ssh -A centos@cluster_ip</code>. Replace <code>centos</code> by the value of <code>sudoer_username</code> if it is different.</li> <li>SSH in the Puppet server instance: <code>ssh puppet</code></li> </ol> <p>Note on Google Cloud: In GCP, OS Login lets you use Compute Engine IAM roles to manage SSH access to Linux instances. This feature is incompatible with Magic Castle. Therefore, it is turned off in the instances metadata (<code>enable-oslogin=\"FALSE\"</code>). The only account with sudoer rights that can log in the cluster is configured by the variable <code>sudoer_username</code> (default: <code>centos</code>).</p>"},{"location":"#101-disable-puppet","title":"10.1 Disable Puppet","text":"<p>If you plan to modify configuration files manually, you will need to disable Puppet. Otherwise, you might find out that your modifications have disappeared in a 30-minute window.</p> <p>Puppet executes a run every 30 minutes and at reboot. To disable puppet: <pre><code>sudo puppet agent --disable \"&lt;MESSAGE&gt;\"\n</code></pre></p>"},{"location":"#102-replace-the-guest-account-password","title":"10.2 Replace the Guest Account Password","text":"<p>Refer to section 4.11.</p>"},{"location":"#103-add-ldap-users","title":"10.3 Add LDAP Users","text":"<p>Users can be added to Magic Castle LDAP database (FreeIPA) with either one of the following methods: hieradata, command-line, and Mokey web-portal. Each method is presented in the following subsections.</p> <p>New LDAP users are automatically assigned a home folder on NFS.</p> <p>Magic Castle determines if an LDAP user should be member of a Slurm account based on its POSIX groups. When a user is added to a POSIX group, a daemon try to match the group name to the following regular expression: <pre><code>(ctb|def|rpp|rrg)-[a-z0-9_-]*\n</code></pre></p> <p>If there is a match, the user will be added to a Slurm account with the same name, and will gain access to the corresponding project folder under <code>/project</code>.</p> <p>Note: The regular expression represents how Compute Canada names its resources allocation. The regular expression can be redefined, see <code>profile::accounts:::project_regex</code></p>"},{"location":"#1031-hieradata","title":"10.3.1 hieradata","text":"<p>Using the hieradata variable in the <code>main.tf</code>, it is possible to define LDAP users.</p> <p>Examples of LDAP user definition with hieradata are provided in puppet-magic_castle documentation.</p>"},{"location":"#1032-command-line","title":"10.3.2 Command-Line","text":"<p>To add a user account after the cluster is built, log in <code>mgmt1</code> and call: <pre><code>kinit admin\nIPA_GUEST_PASSWD=&lt;new_user_passwd&gt; /sbin/ipa_create_user.py &lt;username&gt; [--group &lt;group_name&gt;]\nkdestroy\n</code></pre></p>"},{"location":"#1033-mokey","title":"10.3.3 Mokey","text":"<p>If user sign-up with Mokey is enabled, users can create their own account at <pre><code>https://mokey.yourcluster.domain.tld/auth/signup\n</code></pre></p> <p>It is possible that an administrator is required to enable the account with Mokey. You can access the administrative panel of FreeIPA at : <pre><code>https://ipa.yourcluster.domain.tld/\n</code></pre></p> <p>The FreeIPA administrator credentials can be retrieved from an encrypted file on the Puppet server. Refer to section 10.14 to know how.</p>"},{"location":"#104-increase-the-number-of-guest-accounts","title":"10.4 Increase the Number of Guest Accounts","text":"<p>To increase the number of guest accounts after creating the cluster with Terraform, simply increase the value of <code>nb_users</code>, then call : <pre><code>terraform apply\n</code></pre></p> <p>Each instance's Puppet agent will be reloaded following the copy of the hieradata files, and the new accounts will be created.</p>"},{"location":"#105-restrict-ssh-access","title":"10.5 Restrict SSH Access","text":"<p>By default, instances tagged <code>login</code> have their port 22 opened to entire world. If you know the range of ip addresses that will connect to your cluster, we strongly recommend that you limit the access to port 22 to this range.</p> <p>To limit the access to port 22, refer to section 4.14 firewall_rules, and replace the <code>cidr</code> of the <code>ssh</code> rule to match the range of ip addresses that have be the allowed to connect to the cluster. If there are more than one range, create multiple rules with distinct names.</p>"},{"location":"#106-add-packages-to-jupyter-default-python-kernel","title":"10.6 Add Packages to Jupyter Default Python Kernel","text":"<p>The default Python kernel corresponds to the Python installed in <code>/opt/ipython-kernel</code>. Each compute node has its own copy of the environment. To add packages to this environment, add the following lines to <code>hieradata</code> in <code>main.tf</code>: <pre><code>jupyterhub::kernel::venv::packages:\n  - package_A\n  - package_B\n  - package_C\n</code></pre></p> <p>and replace <code>package_*</code> by the packages you need to install. Then call: <pre><code>terraform apply\n</code></pre></p>"},{"location":"#107-activate-globus-endpoint","title":"10.7 Activate Globus Endpoint","text":"<p>No longer supported</p>"},{"location":"#108-recovering-from-puppet-rebuild","title":"10.8 Recovering from puppet rebuild","text":"<p>The modifications of some of the parameters in the <code>main.tf</code> file can trigger the rebuild of the <code>puppet</code> instance. This instance hosts the Puppet Server on which depends the Puppet agent of the other instances. When <code>puppet</code> is rebuilt, the other Puppet agents cease to recognize Puppet Server identity since the Puppet Server identity and certificates have been regenerated.</p> <p>To fix the Puppet agents, you will need to apply the following commands on each instance other than <code>puppet</code> once <code>puppet</code> is rebuilt: <pre><code>sudo systemctl stop puppet\nsudo rm -rf /etc/puppetlabs/puppet/ssl/\nsudo systemctl start puppet\n</code></pre></p> <p>Then, on <code>puppet</code>, you will need to sign the new certificate requests made by the instances. First, you can list the requests: <pre><code>sudo /opt/puppetlabs/bin/puppetserver ca list\n</code></pre></p> <p>Then, if every instance is listed, you can sign all requests: <pre><code>sudo /opt/puppetlabs/bin/puppetserver ca sign --all\n</code></pre></p> <p>If you prefer, you can sign individual request by specifying their name: <pre><code>sudo /opt/puppetlabs/bin/puppetserver ca sign --certname NAME[,NAME]\n</code></pre></p>"},{"location":"#109-dealing-with-banned-ip-addresses-fail2ban","title":"10.9 Dealing with banned ip addresses (fail2ban)","text":"<p>Login nodes run fail2ban, an intrusion prevention software that protects login nodes from brute-force attacks. fail2ban is configured to ban ip addresses that attempted to login 20 times and failed in a window of 60 minutes. The ban time is 24 hours.</p> <p>In the context of a workshop with SSH novices, the 20-attempt rule might be triggered, resulting in participants banned and puzzled, which is a bad start for a workshop. There are solutions to mitigate this problem.</p>"},{"location":"#1091-define-a-list-of-ip-addresses-that-can-never-be-banned","title":"10.9.1 Define a list of ip addresses that can never be banned","text":"<p>fail2ban keeps a list of ip addresses that are allowed to fail to login without risking jail time. To add an ip address to that list,  add the following lines to the variable <code>hieradata</code>\u00a0in <code>main.tf</code>: <pre><code>profile::fail2ban::ignoreip:\n  - x.x.x.x\n  - y.y.y.y\n</code></pre> where <code>x.x.x.x</code> and <code>y.y.y.y</code> are ip addresses you want to add to the ignore list. The ip addresses can be written using CIDR notations. The ignore ip list on Magic Castle already includes <code>127.0.0.1/8</code> and the cluster subnet CIDR.</p> <p>Once the line is added, call: <pre><code>terraform apply\n</code></pre></p>"},{"location":"#1092-remove-fail2ban-ssh-route-jail","title":"10.9.2 Remove fail2ban ssh-route jail","text":"<p>fail2ban rule that banned ip addresses that failed to connect with SSH can be disabled. To do so, add the following line to the variable <code>hieradata</code>\u00a0in <code>main.tf</code>: <pre><code>fail2ban::jails: ['ssh-ban-root']\n</code></pre> This will keep the jail that automatically ban any ip that tries to login as root, and remove the ssh failed password jail.</p> <p>Once the line is added, call: <pre><code>terraform apply\n</code></pre></p>"},{"location":"#1093-unban-ip-addresses","title":"10.9.3 Unban ip addresses","text":"<p>fail2ban ban ip addresses by adding rules to iptables. To remove these rules, you need to tell fail2ban to unban the ips.</p> <p>To list the ip addresses that are banned, execute the following command: <pre><code>sudo fail2ban-client status ssh-route\n</code></pre></p> <p>To unban ip addresses, enter the following command followed by the ip addresses you want to unban: <pre><code>sudo fail2ban-client set ssh-route unbanip\n</code></pre></p>"},{"location":"#1094-disable-fail2ban","title":"10.9.4 Disable fail2ban","text":"<p>While this is not recommended, fail2ban can be completely disabled. To do so, add the following line to the variable <code>hieradata</code>\u00a0in <code>main.tf</code>: <pre><code>fail2ban::service_ensure: 'stopped'\n</code></pre></p> <p>then call : <pre><code>terraform apply\n</code></pre></p>"},{"location":"#1011-set-selinux-in-permissive-mode","title":"10.11 Set SELinux in permissive mode","text":"<p>SELinux can be set in permissive mode to debug new workflows that would be prevented by SELinux from working properly. To do so, add the following line to the variable <code>hieradata</code>\u00a0in <code>main.tf</code>: <pre><code>selinux::mode: 'permissive'\n</code></pre></p>"},{"location":"#1012-create-a-compute-node-image","title":"10.12 Create a compute node image","text":"<p>When scaling the compute node pool, either manually by changing the count or automatically with Slurm autoscale, it can become beneficial to reduce the time spent configuring the machine when it boots for the first time, hence reducing the time requires before it becomes available in Slurm. One way to achieve this is to clone the root disk of a fully configured compute node and use it as the base image of future compute nodes.</p> <p>This process has three steps:</p> <ol> <li>Prepare the volume for image cloning</li> <li>Create the image</li> <li>Configure Magic Castle Terraform code to use the new image</li> </ol> <p>The following subsection explains how to accomplish each step.</p> <p>Warning: While it will work in most cases, avoid re-using the compute node image of a previous deployment. The preparation steps cleans most of the deployment specific configuration and secrets, but there is no guarantee that the configuration will be entirely compatible with a different deployment.</p>"},{"location":"#10121-prepare-the-volume-for-cloning","title":"10.12.1 Prepare the volume for cloning","text":"<p>The environment puppet-magic_castle installs a script that prepares the volume for cloning named <code>prepare4image.sh</code>.</p> <p>To make sure a node is ready for cloning, open its puppet agent log and validate the catalog was successfully applied at least once: <pre><code>journalctl -u puppet | grep \"Applied catalog\"\n</code></pre></p> <p>To prepare the volume for cloning, execute the following line while connected to the compute node: <pre><code>sudo /usr/sbin/prepare4image.sh\n</code></pre></p> <p>Be aware that, since it is preferable for the instance to be powered off when cloning its volume, the script halts the machine once it is completed. Therefore, after executing <code>prepare4image.sh</code>, you will be disconnected from the instance.</p> <p>The script <code>prepare4image.sh</code> executes the following steps in order:</p> <ol> <li>Stop and disable puppet agent</li> <li>Stop and disable slurm compute node daemon (<code>slurmd</code>)</li> <li>Stop and disable consul agent daemon</li> <li>Stop and disable consul-template daemon</li> <li>Unenroll the host from the IPA server</li> <li>Remove puppet agent configuration files in <code>/etc</code></li> <li>Remove consul agent identification files</li> <li>Unmount NFS directories</li> <li>Remove NFS directories <code>/etc/fstab</code></li> <li>Stop syslog</li> <li>Clear <code>/var/log/message</code> content</li> <li>Remove cloud-init's logs and artifacts so it can re-run</li> <li>Power off the machine</li> </ol>"},{"location":"#10122-create-the-image","title":"10.12.2 Create the image","text":"<p>Once the instance is powered off, access your cloud provider dashboard, find the instance and follow the provider's instructions to create the image.</p> <ul> <li>AWS</li> <li>Azure</li> <li>GCP</li> <li>OpenStack</li> <li>OVH</li> </ul> <p>Note down the name/id of the image you created, it will be needed during the next step.</p>"},{"location":"#10123-configure-magic-castle-terraform-code-to-use-the-new-image","title":"10.12.3 Configure Magic Castle Terraform code to use the new image","text":"<p>Edit your <code>main.tf</code> and add <code>image = \"name-or-id-of-your-image\"</code> to the dictionary defining the instance. The instance previously powered off will be powered on and future non-instantiated machines will use the image at the next execution of <code>terraform apply</code>.</p> <p>If the cluster is composed of heterogeneous compute nodes, it is possible to create an image for each type of compute nodes. Here is an example with Google Cloud <pre><code>instances = {\n  mgmt   = { type = \"n2-standard-2\", tags = [\"puppet\", \"mgmt\", \"nfs\"], count = 1 }\n  login  = { type = \"n2-standard-2\", tags = [\"login\", \"public\", \"proxy\"], count = 1 }\n  node   = {\n    type = \"n2-standard-2\"\n    tags = [\"node\", \"pool\"]\n    count = 10\n    image = \"rocky-mc-cpu-node\"\n  }\n  gpu    = {\n    type = \"n1-standard-2\"\n    tags = [\"node\", \"pool\"]\n    count = 10\n    gpu_type = \"nvidia-tesla-t4\"\n    gpu_count = 1\n    image = \"rocky-mc-gpu-node\"\n  }\n}\n</code></pre></p>"},{"location":"#1013-read-and-edit-secret-values-generated-at-boot","title":"10.13 Read and edit secret values generated at boot","text":"<p>During the cloud-init initialization phase, <code>bootstrap.sh</code> script is executed. This script generates a set of encrypted secret values that are required by the Magic Castle Puppet environment:</p> <ul> <li><code>profile::consul::acl_api_token</code></li> <li><code>profile::freeipa::mokey::password</code></li> <li><code>profile::freeipa::server::admin_password</code></li> <li><code>profile::freeipa::server::ds_password</code></li> <li><code>profile::slurm::accounting::password</code></li> <li><code>profile::slurm::base::munge_key</code></li> </ul> <p>To read or change the value of one of these keys, use <code>eyaml edit</code> command on the <code>puppet</code> host, like this: <pre><code>sudo /opt/puppetlabs/puppet/bin/eyaml edit \\\n  --pkcs7-private-key /etc/puppetlabs/puppet/eyaml/boot_private_key.pkcs7.pem \\\n  --pkcs7-public-key /etc/puppetlabs/puppet/eyaml/boot_public_key.pkcs7.pem \\\n  /etc/puppetlabs/code/environments/production/data/bootstrap.yaml\n</code></pre></p> <p>It is also possible to redefine the values of these keys by adding the key-value pair to the hieradata configuration file. Refer to section 4.13 hieradata. User defined values take precedence over boot generated values in the Magic Castle Puppet data hierarchy.</p>"},{"location":"#1014-expand-a-volume","title":"10.14 Expand a volume","text":"<p>Volumes defined in the <code>volumes</code> map can be expanded at will. To enable online extension of a volume, add <code>enable_resize = true</code> to its specs map. You can then increase the size at will. The corresponding volume will be expanded by the cloud provider and the filesystem will be extended by Puppet.</p>"},{"location":"#11-customize-magic-castle-terraform-files","title":"11. Customize Magic Castle Terraform Files","text":"<p>You can modify the Terraform module files in the folder named after your cloud provider (e.g: <code>gcp</code>, <code>openstack</code>, <code>aws</code>, etc.)</p>"},{"location":"design/","title":"Design","text":""},{"location":"design/#magic-castle-terraform-structure","title":"Magic Castle Terraform Structure","text":"<p>Figure 1 (below) illustrates how Magic Castle is structured to provide a unified interface between multiple cloud providers. Each blue block is a file or a module, while white blocks are variables or resources. Arrows indicate variables or resources that contribute to the definition of the linked variables or resources. The figure can be read as a flow-chart from top to bottom. Some resources and variables have been left out of the chart to avoid cluttering it further.</p> <p> Figure 1. Magic Castle Terraform Project Structure</p> <ol> <li><code>main.tf</code>: User provides the instances and volumes structure they wants as _map_s.     <pre><code>instances = {\n  mgmt  = { type = \"p4-7.5gb\", tags = [\"puppet\", \"mgmt\", \"nfs\"] }\n  login = { type = \"p2-3.75gb\", tags = [\"login\", \"public\", \"proxy\"] }\n  node  = { type = \"p2-3.75gb\", tags = [\"node\"], count = 2 }\n}\n\nvolumes = {\n  nfs = {\n    home     = { size = 100 }\n    project  = { size = 500 }\n    scratch  = { size = 500 }\n  }\n}\n</code></pre></li> <li> <p><code>common/design</code>: </p> <ol> <li>the <code>instances</code> map is expanded to form a new map where each entry represents a single host.     <pre><code>instances = {\n  mgmt1 = {\n    type = \"p2-3.75gb\"\n    tags = [\"puppet\", \"mgmt\", \"nfs\"]\n  }\n  login1 = {\n    type = \"p2-3.75gb\"\n    tags = [\"login\", \"public\", \"proxy\"]\n  }\n  node1 = {\n    type = \"p2-3.75gb\"\n    tags = [\"node\"]\n  }\n  node2 = {\n    type = \"p2-3.75gb\"\n    tags = [\"node\"]\n  }\n}\n</code></pre></li> <li>the <code>volumes</code> map is expanded to form a new map where each entry represent a single volume     <pre><code>volumes = {\n  mgmt1-nfs-home    = { size = 100 }\n  mgmt1-nfs-project = { size = 100 }\n  mgmt1-nfs-scratch = { size = 500 }\n}\n</code></pre></li> </ol> </li> <li> <p><code>network.tf</code>: the <code>instances</code> map from <code>common/design</code> is used to generate a network interface (nic) for each host, and a public ip address for each host with the <code>public</code> tag.     <pre><code>resource \"provider_network_interface\" \"nic\" {\n  for_each = module.design.instances\n  ...\n}\n</code></pre></p> </li> <li> <p><code>common/configuration</code>: for each host in <code>instances</code>, a cloud-init yaml config that includes <code>puppetservers</code> is generated. These configs are outputted to a <code>user_data</code> map where the keys are the hostnames.     <pre><code>user_data = {\n  for key, values in var.instances :\n    key =&gt; templatefile(\"${path.module}/puppet.yaml\", { ... })\n}\n</code></pre></p> </li> <li> <p><code>infrastructure.tf</code>: for each host in <code>instances</code>, an instance resource as defined by the selected cloud provider is generated. Each instance is initially configured by its <code>user_data</code> cloud-init yaml config.     <pre><code>resource \"provider_instance\" \"instances\" {\n  for_each  = module.design.instance\n  user_data = module.instance_config.user_data[each.key]\n  ...\n}\n</code></pre></p> </li> <li> <p><code>infrastructure.tf</code>: for each volume in <code>volumes</code>, a block device as defined by the selected cloud provider is generated and attached it to its matching instance using an <code>attachment</code> resource.     <pre><code>resource \"provider_volume\" \"volumes\" {\n  for_each = module.design.volumes\n  size     = each.value.size\n  ...\n}\nresource \"provider_attachment\" \"attachments\" {\n  for_each    = module.design.volumes\n  instance_id = provider_instance.instances[each.value.instance].id\n  volume_id   = provider_volume.volumes[each.key].id\n  ...\n}\n</code></pre></p> </li> <li> <p><code>infrastructure.tf</code>: the created instances' information are consolidated in a map named <code>inventory</code>.     <pre><code>inventory = {\n  mgmt1 = {\n    public_ip = \"\"\n    local_ip  = \"10.0.0.1\"\n    id        = \"abc1213-123-1231\"\n    tags      = [\"mgmt\", \"puppet\", \"nfs\"]\n  }\n  ...\n}\n</code></pre></p> </li> <li> <p><code>common/provision</code>: the information from created instances is consolidated and written in a yaml file named<code>terraform_data.yaml</code> that is uploaded on the Puppet server as part of the hieradata.     <pre><code>resource \"terraform_data\" \"deploy_puppetserver_files\" {\n  ...\n  provisioner \"file\" {\n    content     = var.terraform_data\n    destination = \"terraform_data.yaml\"\n  }\n  ...\n}\n</code></pre></p> </li> <li> <p><code>outputs.tf</code>: the information of all instances that have a public address are output as a map named <code>public_instances</code>.</p> </li> </ol>"},{"location":"design/#resource-per-provider","title":"Resource per provider","text":"<p>In the previous section, we have used generic resource name when writing HCL code that defines these resources. The following table indicate what resource is used for each provider based on its role in the cluster.</p> Resource AWS Azure Google Cloud Platform OpenStack OVH network aws_vpc azurerm_virtual_network google_compute_network prebuilt openstack_networking_network_v2 subnet aws_subnet azurerm_subnet google_compute_subnetwork prebuilt openstack_networking_subnet_v2 router aws_route not used google_compute_router built-in not used nat aws_internet_gateway not used google_compute_router_nat built-in not used firewall aws_security_group azurerm_network_security_group google_compute_firewall openstack_compute_secgroup_v2 openstack_compute_secgroup_v2 nic aws_network_interface azurerm_network_interface google_compute_address openstack_networking_port_v2 openstack_networking_port_v2 public ip aws_eip azurerm_public_ip google_compute_address openstack_networking_floatingip_v2 openstack_networking_network_v2 instance aws_instance azurerm_linux_virtual_machine google_compute_instance openstack_compute_instance_v2 openstack_compute_instance_v2 volume aws_ebs_volume azurerm_managed_disk google_compute_disk openstack_blockstorage_volume_v3 openstack_blockstorage_volume_v3 attachment aws_volume_attachment azurerm_virtual_machine_data_disk_attachment google_compute_attached_disk openstack_compute_volume_attach_v2 openstack_compute_volume_attach_v2"},{"location":"design/#using-reference-design-to-extend-for-a-new-cloud-provider","title":"Using reference design to extend for a new cloud provider","text":"<p>Magic Castle currently supports five cloud providers, but its design makes it easy to add new providers. This section presents a step-by-step guide to add a new cloud provider support to Magic Castle.</p> <ol> <li> <p>Identify the resources. Using the Resource per provider table, read the cloud provider Terraform documentation, and identify the name for each resource in the table.</p> </li> <li> <p>Check minimum requirements. Once all resources have been identified, you should be able to determine if the cloud provider can be used to deploy Magic Castle. If you found a name for each resource listed in table, the cloud provider can be supported. If some resources are missing, you will need to read the provider's documentation to determine if the absence of the resource can be compensated for somehow.</p> </li> <li> <p>Initialize the provider folder. Create a folder named after the provider. In this folder, create two symlinks, one pointing to <code>common/variables.tf</code> and the other to <code>common/outputs.tf</code>. These files define the interface common to all providers supported by Magic Castle.</p> </li> <li> <p>Define cloud provider specifics variables. Create a file named after your provider <code>provider_name.tf</code>\u00a0and define variables that are required by the provider but not common to all providers, for example the availability zone or the region. In this file, define two local variables named <code>cloud_provider</code> and <code>cloud_region</code>.</p> </li> <li> <p>Initialize the infrastructure. Create a file named  <code>infrastructure.tf</code>. In this file:</p> <ol> <li>define the provider block if it requires input parameters, i.e: var.region     <pre><code>provider \"provider_name\" {\n  region = var.region\n}\n</code></pre></li> <li>include the design module     <pre><code>module \"design\" {\n  source       = \"../common/design\"\n  cluster_name = var.cluster_name\n  domain       = var.domain\n  instances    = var.instances\n  pool         = var.pool\n  volumes      = var.volumes\n}\n</code></pre></li> </ol> </li> <li> <p>Create the networking infrastructure. Create a file named <code>network.tf</code> and define the network, subnet, router, nat, firewall, nic and public ip resources using the <code>module.design.instances</code> map.</p> </li> <li> <p>Create the volumes. In <code>infrastructure.tf</code>, define the <code>volumes</code> resource using <code>module.design.volumes</code>.</p> </li> <li> <p>Consolidate the instances' information.  In <code>infrastructure.tf</code>, define a local variable named <code>inventory</code> that will be a map containing the following keys for each instance: <code>public_ip</code>, <code>local_ip</code>, <code>prefix</code>, <code>tags</code>, and <code>specs</code> (#cpu, #gpus, ram, volumes). For the volumes, you need to provide the paths under which the volumes will be found on the instances to which they are attached. This is typically derived from the volume id. Here is an example:   <pre><code>volumes = contains(keys(module.design.volume_per_instance), x) ? {\n  for pv_key, pv_values in var.volumes:\n    pv_key =&gt; {\n      for name, specs in pv_values:\n        name =&gt; [\"/dev/disk/by-id/*${substr(provider.volumes[\"${x}-${pv_key}-${name}\"].id, 0, 20)}\"]\n    } if contains(values.tags, pv_key)\n  } : {}\n</code></pre></p> </li> <li> <p>Create the instance configurations. In <code>infrastructure.tf</code>, include the <code>common/configuration</code> module like this:     <pre><code>module \"configuration\" {\n  source                = \"../common/configuration\"\n  inventory             = local.inventory\n  config_git_url        = var.config_git_url\n  config_version        = var.config_version\n  sudoer_username       = var.sudoer_username\n  public_keys           = var.public_keys\n  domain_name           = module.design.domain_name\n  cluster_name          = var.cluster_name\n  guest_passwd          = var.guest_passwd\n  nb_users              = var.nb_users\n  software_stack        = var.software_stack\n  cloud_provider        = local.cloud_provider\n  cloud_region          = local.cloud_region\n}\n</code></pre></p> </li> <li> <p>Create the instances. In <code>infrastructure.tf</code>, define the <code>instances</code> resource using <code>module.design.instances_to_build</code> for the instance attributes and <code>module.configuration.user_data</code> for the initial configuration.</p> </li> <li> <p>Attach the volumes. In <code>infrastructure.tf</code>, define the <code>attachments</code> resource using <code>module.design.volumes</code> and refer to the attribute <code>each.value.instance</code> to retrieve the instance's id to which the volume needs to be attached.</p> </li> <li> <p>Identify the public instances. In <code>infrastructure.tf</code>, define a local variable named <code>public_instances</code> that contains the attributes of instances that are publicly accessible from Internet and their ids.   <pre><code>locals {\n  public_instances = { for host in keys(module.design.instances_to_build):\n    host =&gt; merge(module.configuration.inventory[host], {id=cloud_provider_instance_resource.instances[host].id})\n    if contains(module.configuration.inventory[host].tags, \"public\")\n  }\n}\n</code></pre></p> </li> <li> <p>Include the provision module to transmit Terraform data to the Puppet server. In <code>infrastructure.tf</code>, include the <code>common/provision</code> module like this   <pre><code>module \"provision\" {\n  source          = \"../common/provision\"\n  bastions        = local.public_instances\n  puppetservers   = module.configuration.puppetservers\n  tf_ssh_key      = module.configuration.ssh_key\n  terraform_data  = module.configuration.terraform_data\n  terraform_facts = module.configuration.terraform_facts\n  hieradata       = var.hieradata\n  sudoer_username = var.sudoer_username\n}\n</code></pre></p> </li> </ol>"},{"location":"design/#an-example","title":"An example","text":"<ol> <li> <p>Identify the resources. For Digital Ocean, Oracle Cloud and Alibaba Cloud, we get the following resource mapping:     | Resource    | Digital Ocean | Oracle Cloud | Alibaba Cloud |     | ----------- | :-------------------- |  :-------------------- |  :-------------------- |     | network     | digitalocean_vpc | oci_core_vcn | alicloud_vpc |     | subnet      | built in vpc | oci_subnet | alicloud_vswitch |     | router      | n/a          | oci_core_route_table | built in vpc |     | nat         | n/a          | oci_core_internet_gateway | alicloud_nat_gateway |     | firewall    | digitalocean_firewall | oci_core_security_list | alicloud_security_group |     | nic         | n/a | built in instance | alicloud_network_interface |     | public ip   | digitalocean_floating_ip | built in instance | alicloud_eip |     | instance    | digitalocean_droplet | oci_core_instance | alicloud_instance |     | volume      | digitalocean_volume | oci_core_volume | alicloud_disk |     | attachment  | digitalocean_volume_attachment | oci_core_volume_attachment | alicloud_disk_attachment |</p> </li> <li> <p>Check minimum requirements. In the preceding table, we can see Digital Ocean does not have the ability to define a network interface. The documentation also leads us to conclude that it is not possible to define the private ip address of the instances before creating them. Because the Puppet server ip address is required before generating the cloud-init YAML config for all instances, including the Puppet server itself, this means it impossible to use Digital Ocean to spawn a Magic Castle cluster.  Oracle Cloud presents the same issue, however, after reading the instance documentation, we find that it is possible to define a static ip address as a string in the instance attribute. It would therefore be possible to create a datastructure in Terraform that would associate each instance hostname with an ip address in the subnet CIDR.  Alibaba cloud has an answer for each resource, so we will use this provider in the following steps.</p> </li> <li> <p>Initialize the provider folder. In a terminal:   <pre><code>git clone https://github.com/ComputeCanada/magic_castle.git\ncd magic_castle\nmkdir alicloud\ncd aliclcoud\nln -s ../common/{variables,outputs}.tf .\n</code></pre></p> </li> <li> <p>Define cloud provider specifics variables. Add the following to a new file <code>alicloud.tf</code>:   <pre><code>variable \"region\" { }\nlocals {\n  cloud_provider  = \"alicloud\"\n  cloud_region    = var.region\n}\n</code></pre></p> </li> <li> <p>Initialize the infrastructure. Add the following to a new file <code>infrastructure.tf</code>:   <pre><code>provider \"alicloud\" {\n  region = var.region\n}\n\nmodule \"design\" {\n  source       = \"../common/design\"\n  cluster_name = var.cluster_name\n  domain       = var.domain\n  instances    = var.instances\n  pool         = var.pool\n  volumes      = var.volumes\n}\n</code></pre></p> </li> <li> <p>Create the networking infrastructure. <code>network.tf</code> base template:   <pre><code>resource \"alicloud_vpc\" \"network\" { }\nresource \"alicloud_vswitch\" \"subnet\" { }\nresource \"alicloud_nat_gateway\" \"nat\" { }\nresource \"alicloud_security_group\" \"firewall\" { }\nresource \"alicloud_security_group_rule\" \"allow_in_services\" { }\nresource \"alicloud_security_group\" \"allow_any_inside_vpc\" { }\nresource \"alicloud_security_group_rule\" \"allow_ingress_inside_vpc\" { }\nresource \"alicloud_security_group_rule\" \"allow_egress_inside_vpc\" { }\nresource \"alicloud_network_interface\" \"nic\" { }\nresource \"alicloud_eip\" \"public_ip\" { }\nresource \"alicloud_eip_association\" \"eip_asso\" { }\n</code></pre></p> </li> <li> <p>Create the volumes. Add and complete the following snippet to <code>infrastructure.tf</code>:   <pre><code>resource \"alicloud_disk\" \"volumes\" {\n  for_each = module.design.volumes\n}\n</code></pre></p> </li> <li> <p>Consolidate the instances' information. Add the following snippet to <code>infrastructure.tf</code>:   <pre><code>locals {\n  inventory = { for x, values in module.design.instances :\n    x =&gt; {\n      public_ip   = contains(values[\"tags\"], \"public\") ? alicloud_eip.public_ip[x].public_ip : \"\"\n      local_ip    = alicloud_network_interface.nic[x].private_ip\n      tags        = values[\"tags\"]\n      id          = alicloud_instance.instances[x].id\n      specs       = {\n        cpus = ...\n        gpus = ...\n        ram = ...\n        volumes = contains(keys(module.design.volume_per_instance), x) ? {\n          for pv_key, pv_values in var.volumes:\n            pv_key =&gt; {\n              for name, specs in pv_values:\n                name =&gt; [\"/dev/disk/by-id/virtio-${replace(alicloud_disk.volumes[\"${x}-${pv_key}-${name}\"].id, \"d-\", \"\")}\"]\n            } if contains(values.tags, pv_key)\n          } : {}\n      }\n    }\n  }\n}\n</code></pre></p> </li> <li> <p>Create the instance configurations. In <code>infrastructure.tf</code>, include the <code>common/configuration</code> module like this:     <pre><code>module \"configuration\" {\n  source                = \"../common/configuration\"\n  inventory             = local.inventory\n  config_git_url        = var.config_git_url\n  config_version        = var.config_version\n  sudoer_username       = var.sudoer_username\n  public_keys           = var.public_keys\n  domain_name           = module.design.domain_name\n  cluster_name          = var.cluster_name\n  guest_passwd          = var.guest_passwd\n  nb_users              = var.nb_users\n  software_stack        = var.software_stack\n  cloud_provider        = local.cloud_provider\n  cloud_region          = local.cloud_region\n}\n</code></pre></p> </li> <li> <p>Create the instances. Add and complete the following snippet to <code>infrastructure.tf</code>:   <pre><code>resource \"alicloud_instance\" \"instances\" {\n  for_each = module.design.instances\n}\n</code></pre></p> </li> <li> <p>Attach the volumes. Add and complete the following snippet to <code>infrastructure.tf</code>:   <pre><code>resource \"alicloud_disk_attachment\" \"attachments\" {\n  for_each = module.design.volumes\n}\n</code></pre></p> </li> <li> <p>Identify the public instances. In <code>infrastructure.tf</code>, define a local variable named <code>public_instances</code> that contains the attributes of instances that are publicly accessible from Internet and their ids.   <pre><code>locals {\n  public_instances = { for host in keys(module.design.instances_to_build):\n    host =&gt; merge(module.configuration.inventory[host], {id=alicloud_instance.instances[host].id})\n    if contains(module.configuration.inventory[host].tags, \"public\")\n  }\n}\n</code></pre></p> </li> <li> <p>Include the provision module to transmit Terraform data to the Puppet server. In <code>infrastructure.tf</code>, include the <code>common/provision</code> module like this   <pre><code>module \"provision\" {\n  source          = \"../common/provision\"\n  bastions        = local.public_instances\n  puppetservers   = module.configuration.puppetservers\n  tf_ssh_key      = module.configuration.ssh_key\n  terraform_data  = module.configuration.terraform_data\n  terraform_facts = module.configuration.terraform_facts\n  hieradata       = var.hieradata\n}\n</code></pre></p> </li> </ol> <p>Once your new provider is written, you can write an example that will use the module to spawn a Magic Castle cluster with that provider.   <pre><code>module \"alicloud\" {\n  source         = \"./alicloud\"\n  config_git_url = \"https://github.com/ComputeCanada/puppet-magic_castle.git\"\n  config_version = \"main\"\n\n  cluster_name = \"new\"\n  domain       = \"my.cloud\"\n  image        = \"centos_7_9_x64_20G_alibase_20210318.vhd\"\n  nb_users     = 10\n\n  instances = {\n    mgmt   = { type = \"ecs.g6.large\", tags = [\"puppet\", \"mgmt\", \"nfs\"] }\n    login  = { type = \"ecs.g6.large\", tags = [\"login\", \"public\", \"proxy\"] }\n    node   = { type = \"ecs.g6.large\", tags = [\"node\"], count = 1 }\n  }\n\n  volumes = {\n    nfs = {\n      home     = { size = 10 }\n      project  = { size = 50 }\n      scratch  = { size = 50 }\n    }\n  }\n\n  public_keys = [file(\"~/.ssh/id_rsa.pub\")]\n\n  # Alicloud specifics\n  region  = \"us-west-1\"\n}\n</code></pre></p>"},{"location":"developers/","title":"Magic Castle Developer Documentation","text":""},{"location":"developers/#table-of-content","title":"Table of Content","text":"<ol> <li>Setup</li> <li>Where to start</li> <li>Puppet environment</li> <li>Troubleshooting</li> <li>Release</li> </ol>"},{"location":"developers/#1-setup","title":"1. Setup","text":"<p>To develop for Magic Castle you will need: * Terraform (&gt;= 1.4.0) * git * Access to a Cloud (e.g.: Compute Canada Arbutus) * Ability to communicate with the cloud provider API from your computer * A cloud project with enough room for the resource described in section Magic Caslte Doc 1.1. * [optional] Puppet Development Kit (PDK)</p>"},{"location":"developers/#2-where-to-start","title":"2. Where to start","text":"<p>The Magic Castle project is defined by Terraform infrastructure-as-code component that is responsible of generating a cluster architecture in a cloud and a Puppet environment component that configures the cluster instances based on their role.</p> <p>If you wish to add device, an instance, add a new networking interface or a filesystem, you will most likely need to develop some Terraform code. The project structure for Terraform code is described in the reference design document. The document also describes how one could work with current Magic Castle code to add support for another cloud provider.</p> <p>If you wish to add a service to one of the Puppet environments, install a new software, modify an instance configuration or role, you will most likely need to develop some Puppet code. The following section provides more details on the Puppet environments available and how to develop them.</p>"},{"location":"developers/#3-puppet-environment","title":"3. Puppet environment","text":"<p>Magic Castle Terraform code initialized every instances to be a Puppet agent and an instance with the tag <code>puppet</code> as the Puppet main server. On the Puppet main server, there is a folder containing the configuration code for the instances of the cluster, this folder is called a Puppet environment and it is pulled from GitHub during the initial configuration of the Puppet main server.</p> <p>The source of that environment is provided to Terraform using the variable <code>config_git_url</code>.</p> <p>A repository describing a Magic Castle Puppet environment must contain at the least the following files and folders: <pre><code>config_git_repo\n\u2523 Puppetfile\n\u2523 environment.conf\n\u2523 hiera.yaml\n\u2517 data\n  \u2517 common.yaml\n\u2517 manifests/\n  \u2517 site.pp\n</code></pre></p> <ul> <li><code>Puppetfile</code> specifies the Puppet modules that need to be installed in the environment.</li> <li><code>environment.conf</code> overrides the primary server default settings for the environment.</li> <li><code>hiera.yaml</code> configures an ordered list of YAML file data sources.</li> <li><code>data/common.yaml</code> is common data source for the instances part of hierarchy defined by <code>hiera.yaml</code>.</li> <li><code>manifests/site.pp</code> defines how each instance will be configured based on their hostname and/or tags.</li> </ul> <p>An example of a bare-bone Magic Castle Puppet environment is available on GitHub: MagicCastle/puppet-environment, while the Puppet environment that replicates a Compute Canada HPC cluster is named ComputeCanada/puppet-magic_castle.</p>"},{"location":"developers/#terraform_datayaml-a-bridge-between-terraform-and-puppet","title":"terraform_data.yaml: a bridge between Terraform and Puppet","text":"<p>To provide information on the deployed resources and the value of the input parameters, Magic Castle Terraform code uploads to the Puppet main server a file named <code>terraform_data.yaml</code>, in the folder <code>/etc/puppetlabs/data/</code>. There is also a symlink created in <code>/etc/puppetlabs/code/environment/production/data/</code> to ease its usage inside the Puppet environment.</p> <p>When included in the data hierarchy (<code>hiera.yaml</code>), <code>terraform_data.yaml</code> can provide information about the instances, the volumes and the variables set by the user through the <code>main.tf</code> file. The file has the following structure: <pre><code>---\nterraform:\n  data:\n    cluster_name: \"\"\n    domain_name: \"\"\n    guest_passwd: \"\"\n    nb_users: \"\"\n    public_keys: []\n    sudoer_username: \"\"\n  instances:\n    host1:\n      hostkeys:\n        rsa: \"\"\n        ed25519: \"\"\n      local_ip: \"x.x.x.x\"\n      prefix: \"host\"\n      public_ip: \"\"\n      specs:\n        \"cpus\": 0\n        \"gpus\": 0\n        \"ram\": 0\n      tags:\n        - \"tag_1\"\n        - \"tag_2\"\n  tag_ip:\n    tag_1:\n      - x.x.x.x\n    tag_2:\n      - x.x.x.x\n  volumes:\n    volume_tag1:\n      volume_1:\n        - \"/dev/disk/by-id/123-*\"\n      volume_2:\n        - \"/dev/disk/by-id/123-abc-*\"\n</code></pre></p> <p>The values provided by <code>terraform_data.yaml</code> can be accessed in Puppet by using the <code>lookup()</code> function. For example, to access an instance's list of tags: <pre><code>lookup(\"terraform.instances.${::hostname}.tags\")\n</code></pre> The data source can also be used to define a key in another data source YAML file by using the <code>alias()</code> function. For example, to define the number of guest accounts using the value of <code>nb_users</code>, we could add this to <code>common.yaml</code> <pre><code>profile::accounts::guests::nb_accounts: \"%{alias('terraform.data.nb_users')}\"\n</code></pre></p>"},{"location":"developers/#configuring-instances-sitepp-and-classes","title":"Configuring instances: site.pp and classes","text":"<p>The configuration of each instance is defined in <code>manifests/site.pp</code> file of the Puppet environment. In this file, it is possible to define a configuration based on an instance hostname <pre><code>node \"mgmt1\" { }\n</code></pre> or using the instance tags by defining the configuration for the <code>default</code> node : <pre><code>node default {\n  $instance_tags = lookup(\"terraform.instances.${::hostname}.tags\")\n  if 'tag_1' in $instances_tags { }\n}\n</code></pre></p> <p>It is possible to define Puppet resource directly in <code>site.pp</code>. However, above a certain level of complexity, which can be reach fairly quickly, it is preferable to define classes and include these classes in <code>site.pp</code> based on the node hostname or tags.</p> <p>Classes can be defined in the Puppet environment under the following path: <code>site/profile/manifests</code>. These classes are named profile classes and the philosophy behind it is explained in Puppet documentation. Because these classes are defined in <code>site/profile</code>, their name has to start with the prefix <code>profile::</code>.</p> <p>It is also possible to include classes defined externally and installed using the <code>Puppetfile</code>. These classes installed by r10k can be found in the <code>modules</code> folder of the Puppet environment.</p>"},{"location":"developers/#4-troubleshooting","title":"4. Troubleshooting","text":""},{"location":"developers/#41-cloud-init","title":"4.1 cloud-init","text":"<p>To test new additions to <code>puppet.yaml</code>, it is possible to execute cloud-init phases manually. There are four steps that can be executed sequentially: init local, init modules config and modules final. Here are the corresponding commands to execute each step: <pre><code>cloud-init init --local\ncloud-init init\ncloud-init modules --mode=config\ncloud-init modules --mode=final\n</code></pre></p> <p>It is also possible to clean a cloud-init execution and have it execute again at next reboot. To do so, enter the following command: <pre><code>cloud-init clean\n</code></pre> Add <code>-r</code> to the previous command to reboot the instance once cloud-init has finishing cleaning.</p>"},{"location":"developers/#42-selinux","title":"4.2 SELinux","text":"<p>SELinux is enabled on every instances of a Magic Castle cluster. Some applications do not provide SELinux policies which can lead to their malfunctionning when SELinux is enabled. It is possible to track down the reasons why SELinux is preventing an application to work properly using the command-line tool <code>ausearch</code>.</p> <p>If you suspect application <code>app-a</code> to be denied by SELinux to work properly, run the following command as root: <pre><code>ausearch -c app-a --raw | grep denied\n</code></pre></p> <p>To see all requests denied by SELinux: <pre><code>ausearch --raw | grep denied\n</code></pre></p> <p>Sometime, the denials are hidden from regular logging. To display all denials, run the following command as root: <pre><code>semodule --disable_dontaudit --build\n</code></pre> then re-execute the application that is not working properly.</p> <p>Once you have found the denials that are the cause of the problem, you can create a new policy to allow the requests that were previously denied with the following command: <pre><code>ausearch -c app-a --raw | grep denied | audit2allow -a -M app-a\n</code></pre></p> <p>Finally, you can install the generated policy using the command provided by <code>auditallow</code>.</p>"},{"location":"developers/#building-the-policy-package-file-pp-from-the-enforcement-file-te","title":"Building the policy package file (.pp) from the enforcement file (.te)","text":"<p>If you need to tweak an existing enforcement file and you want to recompile the policy package, you can with the following commands: <pre><code>checkmodule -M -m -o my_policy.mod my_policy.te\nsemodule_package -o my_policy.pp -m my_policy.mod\n</code></pre></p>"},{"location":"developers/#references","title":"References","text":"<ul> <li>https://wiki.gentoo.org/wiki/SELinux</li> <li>https://wiki.gentoo.org/wiki/SELinux/Tutorials/Where_to_find_SELinux_permission_denial_details</li> </ul>"},{"location":"developers/#5-release","title":"5. Release","text":"<p>To build a release, use the script <code>release.sh</code> located at the root of Magic Castle git repo. <pre><code>Usage: release.sh VERSION [provider ...]\n</code></pre> The script creates a folder named <code>releases</code> where it was called.</p> <p>The <code>VERSION</code> argument is expected to correspond to git tag in the <code>puppet-magic_castle</code> repo. It could also be a branch name or a commit. If the provider optional argument is left blank, release files will be built for all providers currently supported by Magic Castle.</p> <p>Examples:</p> <ul> <li>Building a release for OpenStack with the puppet repo main branch:     <pre><code>$ ./release.sh main openstack\n</code></pre></li> <li>Building a release for GCP with the latest Terraform and cloud-init, and version 5.8 of puppet Magic Castle:     <code>$ ./release.sh 5.8 gcp</code></li> <li>Building a release for Azure and OVH with the latest Terraform and cloud-init, and version 5.7 of puppet Magic Castle:     <pre><code>$ ./release.sh 5.7 azure ovh\n</code></pre></li> </ul>"},{"location":"matrix/","title":"Comparison of Cloud HPC Cluster Projects","text":"Name Creator First public release date Software license AWS ParallelCluster AWS November 12, 2018 Apache License v2 Azure CycleCloud Microsoft October 17, 2018 MIT License Azure HPC On-Demand Platform Microsoft April 23, 2021 MIT License Cluster in the Cloud Matt Williams  - University of Bristol March 27, 2019 MIT License ElastiCluster Riccardo Murri - University of Zurich July 17, 2013 GPLv3 Google HPC-Toolkit Google May 26, 2022 Apache License v2 Magic Castle F\u00e9lix-Antoine Fortin - Compute Canada August 26, 2019 MIT License On-Demand Data Centre Adaptive Computing - - Slurm on GCP SchedMD March 14, 2018 Apache License v2"},{"location":"matrix/#supported-cloud-providers","title":"Supported cloud providers","text":"Name Alibaba Cloud AWS Azure Google Cloud IBM Cloud OpenStack Oracle Cloud OVH\u00a0 AWS ParallelCluster no yes no no no no no no Azure CycleCloud no no yes no no no no no Azure HPC On-Demand Platform no no yes no no no no no Cluster-in-the-Cloud no yes no yes no no yes no ElastiCluster* no yes yes yes no yes no - Google HPC-Toolkit no no no yes no no no no Magic Castle* no yes yes yes no yes no yes On-Demand Data Centre yes yes yes yes no no yes no Slurm on GCP no no no yes no no no no <p>* The documentation provides instructions on how to add support for other cloud providers.</p>"},{"location":"matrix/#supported-operating-systems","title":"Supported operating systems","text":"Name CentOS 7 CentOS 8 Rocky Linux 8 AlmaLinux 8 Debian 10 Ubuntu 18 Ubuntu 20 Windows 10 AWS ParallelCluster yes yes yes yes yes no yes no Azure CycleCloud yes yes yes yes yes no yes - Azure HPC On-Demand Platform yes no no yes no yes no yes Google HPC-Toolkit yes no no no no no no no Cluster in the Cloud no yes no no no no no no ElastiCluster yes yes yes yes no no no no Magic Castle no yes yes yes no no no no On-Demand Data Centre - - - - - - - - Slurm on GCP yes no yes no yes no yes no"},{"location":"matrix/#supported-job-schedulers","title":"Supported job schedulers","text":"Name AwsBatch Grid Engine HTCondor Moab Open PBS PBS Pro Slurm AWS ParallelCluster yes no no no no no yes Azure CycleCloud no yes yes no no yes yes Azure HPC On-Demand Platform no no no no yes no yes Google HPC-Toolkit no no no no no no yes Cluster in the Cloud no no no no no no yes ElastiCluster no yes no no no no yes Magic Castle no no no no no no yes On-Demand Data Centre no no no yes no no no Slurm on GCP no no no no no no yes"},{"location":"matrix/#technologies","title":"Technologies","text":"Name Infrastructure configuration Programming languages Configuration management Scientific software AWS ParallelCluster CLI generating YAML Python Chef Spack Azure CycleCloud WebUI or CLI + templates Python Chef Bring your own Azure HPC On-Demand Platform YAML files + shell scripts Shell, Terraform Ansible, Packer CVMFS Cluster in the Cloud CLI generating Terraform code Python, Terraform Ansible, Packer EESSI ElastiCluster CLI interpreting an INI file Python, Shell Ansible Bring your own Google HPC-Toolkit CLI generating Terraform code Go, Terraform Ansible, Packer Spack Magic Castle Terraform modules Terraform Puppet CC-CVMFS, EESSI On-Demand Data Centre - - - - Slurm GCP Terraform modules Terraform Ansible, Packer Spack"},{"location":"sequence/","title":"Magic Castle Sequence Diagrams","text":"<p>The following sequence diagrams illustrate the inner working of Magic Castle once <code>terraform apply</code> is called. Some details were left out of the diagrams, but every diagram is followed by references to the code files that were used to build it.</p>"},{"location":"sequence/#1-cluster-creation","title":"1. Cluster creation","text":""},{"location":"sequence/#notes","title":"Notes","text":"<ol> <li><code>puppet-magic_castle.git</code> does not have to refer to <code>ComputeCanada/puppet-magic_castle.git</code> repo. Users can use their own fork. See the developer documentation for more details.</li> </ol>"},{"location":"sequence/#references","title":"References","text":"<ul> <li><code>magic_castle:/common/design/main.tf</code></li> <li><code>magic_castle:/openstack/network-1.tf</code></li> <li><code>magic_castle:/openstack/network-2.tf</code></li> <li><code>magic_castle:/common/configuration/main.tf</code></li> <li><code>magic_castle:/openstack/infrastructure.tf</code></li> <li><code>magic_castle:/common/provision/main.tf</code></li> <li><code>magic_castle:/dns/cloudflare/main.tf</code></li> </ul>"},{"location":"sequence/#2-configuration-with-cloud-init","title":"2. Configuration with cloud-init","text":""},{"location":"sequence/#notes_1","title":"Notes","text":"<ol> <li><code>config_git_url repo</code> does not have to refer to <code>ComputeCanada/puppet-magic_castle.git</code> repo. Users can use their own fork. See the developer documentation for more details.</li> <li>While the diagram represents each step as completed sequentially, each node provisioning is independent. The only step that requires synchronisation between nodes and the management node is the puppet certificate generation.</li> </ol>"},{"location":"sequence/#references_1","title":"References","text":"<ul> <li><code>magic_castle:/common/configuration/puppet.yaml</code></li> </ul>"},{"location":"sequence/#3-configuration-with-puppet","title":"3. Configuration with Puppet","text":""},{"location":"sequence/#references_2","title":"References","text":"<ul> <li><code>puppet-magic_castle:/manifests/site.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/base.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/consul.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/freeipa.pp</code></li> </ul>"},{"location":"sequence/#4-configuration-with-consul-and-consul-template","title":"4. Configuration with Consul and Consul Template","text":""},{"location":"sequence/#references_3","title":"References","text":"<ul> <li><code>puppet-magic_castle:/profile/manifests/consul.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/cvmfs.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/slurm.pp</code></li> </ul>"},{"location":"terraform_cloud/","title":"Terraform Cloud","text":"<p>This document explains how to use Magic Castle with Terraform Cloud.</p>"},{"location":"terraform_cloud/#what-is-terraform-cloud","title":"What is Terraform Cloud?","text":"<p>Terraform Cloud is HashiCorp\u2019s managed service that allows to provision infrastructure using a web browser or a REST API instead of the command-line. This also means that the provisioned infrastructure parameters can be modified by a team and the state is stored in the cloud instead of a local machine.</p> <p>When provisioning in commercial cloud, Terraform Cloud can also provide a cost estimate of the resources.</p>"},{"location":"terraform_cloud/#getting-started-with-terraform-cloud","title":"Getting started with Terraform Cloud","text":"<ol> <li>Create a Terraform Cloud account</li> <li>Create an organization, join one or choose one available to you</li> </ol>"},{"location":"terraform_cloud/#managing-a-magic-castle-cluster-with-terraform-cloud","title":"Managing a Magic Castle cluster with Terraform Cloud","text":""},{"location":"terraform_cloud/#creating-the-workspace","title":"Creating the workspace","text":"<ol> <li>Create a git repository in GitHub, GitLab, or any of the version control system provider supported by Terraform Cloud</li> <li>In this git repository, add a copy of the Magic Castle example <code>main.tf</code> available for the cloud of your choice</li> <li>Log in Terraform Cloud account</li> <li>Create a new workspace<ol> <li>Choose Type: \"Version control workflow\"</li> <li>Connect to VCS: choose the version control provider that hosts your repository</li> <li>Choose the repository that contains your <code>main.tf</code></li> <li>Configure settings: tweak the name and description to your liking</li> <li>Click on \"Create workspace\"</li> </ol> </li> </ol> <p>You will be redirected automatically to your new workspace.</p>"},{"location":"terraform_cloud/#providing-cloud-provider-credentials-to-terraform-cloud","title":"Providing cloud provider credentials to Terraform Cloud","text":"<p>Terraform Cloud will invoke Terraform command-line in a remote virtual environment. For the CLI to be able to communicate with your cloud provider API, we need to define environment variables that Terraform will use to authenticate. The next sections explain which environment variables to define for each cloud provider and how to retrieve the values of the variable from the provider.</p> <p>If you plan on using these environment variables with multiple workspaces, it is recommended to create a credential variable set in Terraform Cloud.</p>"},{"location":"terraform_cloud/#aws","title":"AWS","text":"<p>You need to define these environment variables: - <code>AWS_ACCESS_KEY_ID</code> - <code>AWS_SECRET_ACCESS_KEY</code> (sensitive)</p> <p>The value of these variables can either correspond to the value of access key created on the AWS Security Credentials - Access keys page, or you can add user dedicated to Terraform Cloud in AWS IAM Users, and use its access key.</p>"},{"location":"terraform_cloud/#azure","title":"Azure","text":"<p>You need to define these environment variables: - <code>ARM_CLIENT_ID</code> - <code>ARM_CLIENT_SECRET</code> (sensitive) - <code>ARM_SUBSCRIPTION_ID</code> - <code>ARM_TENANT_ID</code></p> <p>Refer to Terraform Azure Provider - Creating a Service Principal to know how to create a Service Principal and retrieve the values for these environment variables.</p>"},{"location":"terraform_cloud/#google-cloud","title":"Google Cloud","text":"<p>You need to define this environment variable: - <code>GOOGLE_CLOUD_KEYFILE_JSON</code> (sensitive)</p> <p>The value of the variable will be the content of a Google Cloud service account JSON key file expressed a single line string. Example: <pre><code>{\"type\": \"service_account\",\"project_id\": \"project-id-1234\",\"private_key_id\": \"abcd1234\",...}\n</code></pre></p> <p>You can use <code>jq</code> to format the string from the JSON file provided by Google: <pre><code>jq . -c project-name-123456-abcdefjg.json\n</code></pre></p>"},{"location":"terraform_cloud/#openstack-ovh","title":"OpenStack / OVH","text":"<p>You need to define these environment variables: - <code>OS_AUTH_URL</code> - <code>OS_PROJECT_ID</code> - <code>OS_REGION_NAME</code> - <code>OS_INTERFACE</code> - <code>OS_IDENTITY_API_VERSION</code> - <code>OS_USER_DOMAIN_NAME</code> - <code>OS_USERNAME</code> - <code>OS_PASSWORD</code> (sensitive)</p> <p>Apart from <code>OS_PASSWORD</code>, the values for these variables are available in OpenStack RC file provided for your project.</p> <p>If you prefer to use OpenStack application credentials, you need to define at least these variables: - <code>OS_AUTH_TYPE</code>\u00a0 - <code>OS_AUTH_URL</code> - <code>OS_APPLICATION_CREDENTIAL_ID</code> - <code>OS_APPLICATION_CREDENTIAL_SECRET</code></p> <p>and potentially these too: - <code>OS_IDENTITY_API_VERSION</code>\u00a0 - <code>OS_REGION_NAME</code> - <code>OS_INTERFACE</code></p> <p>The values for these variables are available in OpenStack RC file provided when creating the application credentials.</p>"},{"location":"terraform_cloud/#providing-dns-provider-credentials-to-terraform-cloud","title":"Providing DNS provider credentials to Terraform Cloud","text":"<p>Terraform Cloud will invoke Terraform command-line in a remote virtual environment. For the CLI to be able to communicate with your DNS provider API, we need to define environment variables that Terraform will use to authenticate. The next sections explain which environment variables to define for each DNS provider and how to retrieve the values of the variable from the provider.</p>"},{"location":"terraform_cloud/#cloudflare","title":"CloudFlare","text":"<p>Refer to DNS - CloudFlare section of Magic Castle main documentation to determine which environment variables needs to be set.</p>"},{"location":"terraform_cloud/#google-cloud-dns","title":"Google Cloud DNS","text":"<p>Refer to DNS - Google Cloud section of Magic Castle main documentation to determine which environment variables needs to be set.</p>"},{"location":"terraform_cloud/#managing-magic-castle-variables-with-terraform-cloud-ui","title":"Managing Magic Castle variables with Terraform Cloud UI","text":"<p>It is possible to use Terraform Cloud web interface to define variable values in your <code>main.tf</code>. For example, you could want to define a guest password without writing it directly in <code>main.tf</code> to avoid displaying publicly.</p> <p>To manage a variable with Terraform Cloud: 1. edit your <code>main.tf</code> to define the variables you want to manage. In the following example, we want to manage the number of nodes and the guest password.</p> <pre><code>Add the variables at the beginning of the `main.tf`:\n  ```hcl\n  variable \"nb_nodes\" {}\n  variable \"password\" {}\n  ```\n\nThen replace the static value by the variable in our `main.tf`,\n\ncompute node count\n  ```hcl\n  node = { type = \"p2-3gb\", tags = [\"node\"], count = var.nb_nodes }\n  ```\nguest password\n  ```hcl\n  guest_passwd = var.password\n  ```\n</code></pre> <ol> <li>Commit and push this changes to your git repository.</li> <li>In Terraform Cloud workspace associated with that repository, go in \"Variables.</li> <li>Under \"Terraform Variables\", click the \"Add variable\" button and create a variable for each one defined previously in the\u00a0<code>main.tf</code>. Check \"Sensitive\" if the variable content should not never be shown in the UI or the API.</li> </ol> <p>You may edit the variables at any point of your cluster lifetime.</p>"},{"location":"terraform_cloud/#applying-changes","title":"Applying changes","text":"<p>To create your cluster, apply changes made to your <code>main.tf</code> or the variables, you will need to queue a plan. When you push to the default branch of the linked git repository, a plan will be automatically created. You can also create a plan manually. To do so, click on the \"Queue plan manually\" button inside your workspace, then \"Queue plan\".</p> <p>Once the plan has been successfully created, you can apply it using the \"Runs\" section. Click on the latest queued plan, then on the \"Apply plan\" button at the bottom of the plan page.</p>"},{"location":"terraform_cloud/#auto-apply","title":"Auto apply","text":"<p>It is possible to apply automatically a successful plan. Go in the \"Settings\" section, and under \"Apply method\" select \"Auto apply\". Any following successful plan will then be automatically applied.</p>"},{"location":"terraform_cloud/#magic-castle-terraform-cloud-and-the-cli","title":"Magic Castle, Terraform Cloud and the CLI","text":"<p>Terraform cloud only allows to apply or destroy the plan as stated in the main.tf, but sometimes it can be useful to run some other terraform commands that are only available through the command-line interface, for example <code>terraform taint</code>.</p> <p>It is possible to import the terraform state of a cluster on your local computer and then use the CLI on it.</p> <ol> <li> <p>Log in Terraform cloud: <pre><code>terraform login\n</code></pre></p> </li> <li> <p>Create a folder where the terraform state will be stored: <pre><code>mkdir my-cluster-1\n</code></pre></p> </li> <li> <p>Create a file named <code>cloud.tf</code> with the following content in your cluster folder: <pre><code>terraform {\n  cloud {\n    organization = \"REPLACE-BY-YOUR-TF-CLOUD-ORG\"\n    workspaces {\n      name = \"REPLACE-BY-THE-NAME-OF-YOUR-WORKSPACE\"\n    }\n  }\n}\n</code></pre> replace the values of <code>organization</code> and <code>name</code> with the appropriate value for your cluster.</p> </li> <li> <p>Initialize the folder and retrieve the state: <pre><code>terraform init\n</code></pre></p> </li> </ol> <p>To confirm the workspace has been properly imported locally, you can list the resources using: <pre><code>terraform state list\n</code></pre></p>"},{"location":"terraform_cloud/#enable-magic-castle-autoscaling","title":"Enable Magic Castle Autoscaling","text":"<p>Magic Castle in combination with Terraform Cloud (TFE) can be configured to give Slurm the ability to create and destroy instances based on the job queue content.</p> <p>To enable this feature: 1. Create a TFE API Token and save it somewhere safe.</p> <pre><code>1.1. If you subscribe to Terraform Cloud Team &amp; Governance plan, you can generate\na [Team API Token](https://www.terraform.io/cloud-docs/users-teams-organizations/api-tokens#team-api-tokens).\nThe team associated with this token requires no access to organization and can be secret.\nIt does not have to include any member. Team API token is preferable as its permissions can be\nrestricted to the minimum required for autoscale purpose.\n</code></pre> <ol> <li> <p>Create a workspace in TFE</p> <p>2.1. Make sure the repo is private as it will contain the API token.</p> <p>2.2. If you generated a Team API Token in 1, provide access to the workspace to the team:</p> <ol> <li>Workspace Settings -&gt; Team Access -&gt; Add team and permissions</li> <li>Select the team</li> <li>Click on \"Customize permissions for this team\"</li> <li>Under \"Runs\" select \"Apply\"</li> <li>Under \"Variables\" select \"Read and write\"</li> <li>Leave the rest as is and click on \"Assign custom permissions\"</li> </ol> <p>2.3 In Configure settings, under Advanced options, for Apply method, select Auto apply.</p> </li> <li> <p>Create the environment variables of the cloud provider credentials in TFE</p> </li> <li>Create a variable named <code>pool</code> in TFE. Set value to <code>[]</code> and check HCL.</li> <li>Add a file named <code>data.yaml</code> in your git repo with the following content:     <code>yaml\u00a0     ---     profile::slurm::controller::tfe_token: &lt;TFE API token&gt;     profile::slurm::controller::tfe_workspace: &lt;TFE workspace id&gt;</code>     Complete the file by replacing <code>&lt;TFE API TOKEN&gt;</code> with the token generated at step 1     and <code>&lt;TFE workspace id&gt;</code> (i.e.: <code>ws-...</code>) by the id of the workspace created at step 2.     It is recommended to encrypt the TFE API token before committing <code>data.yaml</code> in git. Refer     to section 4.15 of README.md to     know how to encrypt the token.</li> <li>Add <code>data.yaml</code> in git and push.</li> <li> <p>Modify <code>main.tf</code>:</p> <ol> <li>If not already present, add the following definition of the pool variable at the beginning of your <code>main.tf</code>.</li> </ol> <pre><code>variable \"pool\" { description = \"Slurm pool of compute nodes\" }\n</code></pre> <ol> <li>Add instances to <code>instances</code> with the tags <code>pool</code> and <code>node</code>. These are   the nodes that Slurm will able to create and destroy.</li> <li>If not already present, add the following line after the instances definition to pass the list of compute nodes from Terraform cloud workspace variable to the provider module:</li> </ol> <pre><code>pool = var.pool\n</code></pre> <ol> <li>On the right-hand-side of <code>public_keys =</code>, replace <code>[file(\"~/.ssh/id_rsa.pub\")]</code>   by a list of SSH public keys that will have admin access to the cluster.</li> <li>After the line <code>public_keys = ...</code>, add <code>hieradata = file(\"data.yaml\")</code>.</li> <li>Stage changes, commit and push to git repo.</li> </ol> </li> <li> <p>Go to your workspace in TFE, click on Actions -&gt; Start a new run -&gt; Plan and apply -&gt; Start run. Then, click on \"Confirm &amp; Apply\" and \"Confirm Plan\".</p> </li> <li>Compute nodes defined in step 8 can be modified at any point in the cluster lifetime and more pool compute nodes can be added or removed if needed.</li> </ol>"},{"location":"terraform_cloud/#considerations-for-autoscaling","title":"Considerations for autoscaling","text":"<p>To reduce the time required for compute nodes to become available in Slurm, consider creating a compute node image.</p> <p>JupyterHub will time out by default after 300 seconds if a node is not spawned yet. Since it may take longer than this to spawn a node, even with an image created, consider increasing the timeout by adding the following to your YAML configuration file:  <pre><code>jupyterhub::jupyterhub_config_hash:\n  SlurmFormSpawner:\n    start_timeout: 900\n</code></pre></p> <p>Slurm 23 adds the possibility for <code>sinfo</code> to report nodes that are not yet spawned. This is useful if you want JupyterHub to be aware of those nodes, for example if you want to allow to use GPU nodes without keeping them online at all time. To use that version of Slurm, add the following to your YAML configuration file: <pre><code>profile::slurm::base::slurm_version: '23.02'\n</code></pre></p>"},{"location":"terraform_cloud/#troubleshoot-autoscaling-with-terraform-cloud","title":"Troubleshoot autoscaling with Terraform Cloud","text":"<p>If after enabling autoscaling with Terraform Cloud for your Magic Castle cluster, the number of nodes does not increase when submitting jobs, verify the following points:</p> <ol> <li>Go to the Terraform Cloud workspace webpage, and look for errors in the runs. If the runs were only triggered by changes to the git repo, it means scaling signals from the cluster do not reach the Terraform cloud workspace or no signals were sent at all.</li> <li>Make sure the Terraform Cloud workspace id matches with the value of <code>profile::slurm::controller::tfe_workspace</code> in <code>data.yaml</code>.</li> <li>Execute <code>squeue</code> on the cluster, and verify the reasons why jobs are still in the queue. If under the column <code>(Reason)</code>, there is the keyword <code>ReqNodeNotAvail</code>, it implies Slurm tried to boot the listed nodes, but they would not show up before the timeout, therefore Slurm marked them as down. It can happen if your cloud provider is slow to build the instances, or following a configuration problem like in 2. When Slurm marks a node as down, a trace is left in slurmctld's log - using zgrep on the slurm controller node (typically <code>mgmt1</code>):     <pre><code>sudo zgrep \"marking down\" /var/log/slurm/slurmctld.log*\n</code></pre>     To tell Slurm these nodes are available again, enter the following command:     <pre><code>sudo /opt/software/slurm/bin/scontrol update nodename=node[Y-Z] state=IDLE\n</code></pre>     Replace <code>node[Y-Z]</code> by the hostname range listed next to <code>ReqNodeNotAvail</code> in <code>squeue</code>.</li> <li>Under <code>mgmt1:/var/log/slurm</code>, look for errors in the file <code>slurm_resume.log</code>.</li> </ol>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Magic Castle Documentation","text":""},{"location":"#1-setup","title":"1. Setup","text":"<p>To use Magic Castle you will need:</p> <ol> <li>Terraform (&gt;= 1.4.0)</li> <li>Authenticated access to a cloud</li> <li>Ability to communicate with the cloud provider API from your computer</li> <li>A project with operational limits meeting the requirements described in Quotas subsection.</li> </ol>"},{"location":"#11-terraform","title":"1.1 Terraform","text":"<p>To install Terraform, follow the tutorial or go directly on Terraform download page.</p> <p>You can verify Terraform was properly installed by looking at the version in a terminal: <pre><code>terraform version\n</code></pre></p>"},{"location":"#12-authentication","title":"1.2 Authentication","text":""},{"location":"#121-amazon-web-services-aws","title":"1.2.1 Amazon Web Services (AWS)","text":"<ol> <li>Go to AWS - My Security Credentials</li> <li>Create a new access key.</li> <li>In a terminal, export <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code>, environment variables, representing your AWS Access Key and AWS Secret Key:     <pre><code>export AWS_ACCESS_KEY_ID=\"an-access-key\"\nexport AWS_SECRET_ACCESS_KEY=\"a-secret-key\"\n</code></pre></li> </ol> <p>Reference: AWS Provider - Environment Variables</p>"},{"location":"#122-google-cloud","title":"1.2.2 Google Cloud","text":"<ol> <li>Install the Google Cloud SDK</li> <li>In a terminal, enter : <code>gcloud auth application-default login</code></li> </ol>"},{"location":"#123-microsoft-azure","title":"1.2.3 Microsoft Azure","text":"<ol> <li>Install Azure CLI</li> <li>In a terminal, enter : <code>az login</code></li> </ol> <p>Reference : Azure Provider: Authenticating using the Azure CLI</p>"},{"location":"#124-openstack-ovh","title":"1.2.4 OpenStack / OVH","text":"<ol> <li> <p>Download your OpenStack Open RC file. It is project-specific and contains the credentials used by Terraform to communicate with OpenStack API. To download, using OpenStack web page go to: Project \u2192 API Access, then click on Download OpenStack RC File then right-click on OpenStack RC File (Identity API v3), Save Link as..., and save the file.</p> </li> <li> <p>In a terminal located in the same folder as your OpenStack RC file, source the OpenStack RC file:     <pre><code>source *-openrc.sh\n</code></pre> This command will ask for a password, enter your OpenStack password.</p> </li> </ol>"},{"location":"#13-cloud-api","title":"1.3 Cloud API","text":"<p>Once you are authenticated with your cloud provider, you should be able to communicate with its API. This section lists for each provider some instructions to test this.</p>"},{"location":"#131-aws","title":"1.3.1 AWS","text":"<ol> <li>In a dedicated temporary folder, create a file named <code>test_aws.tf</code> with the following content:     <pre><code>provider \"aws\" {\n  region = \"us-east-1\"\n}\n\ndata \"aws_ec2_instance_type\" \"example\" {\n  instance_type = \"t2.micro\"\n}\n</code></pre></li> <li>In a terminal, move to where the file is located, then:     <pre><code>terraform init\n</code></pre></li> <li>Finally, test terraform communication with AWS:     <pre><code>terraform plan\n</code></pre>     If everything is configured properly, terraform will output:     <pre><code>No changes. Your infrastructure matches the configuration.\n</code></pre>     Otherwise, it will output:     <pre><code>Error: error configuring Terraform AWS Provider: no valid credential sources for Terraform AWS Provider found.\n</code></pre></li> <li>You can delete the temporary folder and its content.</li> </ol>"},{"location":"#132-google-cloud","title":"1.3.2 Google Cloud","text":"<p>In a terminal, enter: <pre><code>gcloud projects list\n</code></pre> It should output a table with 3 columns <pre><code>PROJECT_ID NAME PROJECT_NUMBER\n</code></pre></p> <p>Take note of the <code>project_id</code> of the Google Cloud project you want to use, you will need it later.</p>"},{"location":"#133-microsoft-azure","title":"1.3.3 Microsoft Azure","text":"<p>In a terminal, enter: <pre><code>az account show\n</code></pre> It should output a JSON dictionary similar to this: <pre><code>{\n  \"environmentName\": \"AzureCloud\",\n  \"homeTenantId\": \"98467e3b-33c2-4a34-928b-ed254db26890\",\n  \"id\": \"4dda857e-1d61-457f-b0f0-e8c784d1fb20\",\n  \"isDefault\": true,\n  \"managedByTenants\": [],\n  \"name\": \"Pay-As-You-Go\",\n  \"state\": \"Enabled\",\n  \"tenantId\": \"495fc59f-96d9-4c3f-9c78-7a7b5f33d962\",\n  \"user\": {\n    \"name\": \"user@example.com\",\n    \"type\": \"user\"\n  }\n}\n</code></pre></p>"},{"location":"#134-openstack-ovh","title":"1.3.4 OpenStack / OVH","text":"<ol> <li>In a dedicated temporary folder, create a file named <code>test_os.tf</code> with the following content:     <pre><code>terraform {\n  required_providers {\n    openstack = {\n      source  = \"terraform-provider-openstack/openstack\"\n    }\n  }\n}\ndata \"openstack_identity_auth_scope_v3\" \"scope\" {\n  name = \"my_scope\"\n}\n</code></pre></li> <li>In a terminal, move to where the file is located, then:     <pre><code>terraform init\n</code></pre></li> <li>Finally, test terraform communication with OpenStack:     <pre><code>terraform plan\n</code></pre>     If everything is configured properly, terraform will output:     <pre><code>No changes. Your infrastructure matches the configuration.\n</code></pre>     Otherwise, it will output:     <pre><code>Error: Error creating OpenStack identity client:\n</code></pre>     if the OpenStack cloud API cannot be reached.</li> <li>You can delete the temporary folder and its content.</li> </ol>"},{"location":"#14-quotas","title":"1.4 Quotas","text":""},{"location":"#141-aws","title":"1.4.1 AWS","text":"<p>The default quotas set by Amazon are sufficient to build the Magic Castle AWS examples. To increase the limits, or request access to special resources like GPUs or high performance network interface, refer to Amazon EC2 service quotas.</p>"},{"location":"#142-google-cloud","title":"1.4.2 Google Cloud","text":"<p>The default quotas set by Google Cloud are sufficient to build the Magic Castle GCP examples. To increase the limits, or request access to special resources like GPUs, refer to Google Compute Engine Resource quotas.</p>"},{"location":"#143-microsoft-azure","title":"1.4.3 Microsoft Azure","text":"<p>The default quotas set by Microsoft Azure are sufficient to build the Magic Castle Azure examples. To increase the limits, or request access to special resources like GPUs or high performance network interface, refer to Azure subscription and service limits, quotas, and constraints.</p>"},{"location":"#144-openstack","title":"1.4.4 OpenStack","text":"<p>Minimum project requirements:</p> <ul> <li>1 floating IP</li> <li>3 security groups</li> <li>1 network (see note 1)</li> <li>1 subnet (see note 1)</li> <li>1 router (see note 1)</li> <li>3 volumes</li> <li>3 instances</li> <li>8 VCPUs</li> <li>7 neutron ports</li> <li>12 GB of RAM</li> <li>8 security rules</li> <li>80 GB of volume storage</li> </ul> <p>Note 1: Magic Castle supposes the OpenStack project comes with a network, a subnet and a router already initialized. If any of these components is missing, you will need to create them manually before launching terraform.</p> <ul> <li>Create and manage networks, JUSUF user documentation</li> <li>Create and manage network - UI, OpenStack Documentation</li> <li>Create and manage network - CLI, OpenStack Documentation</li> </ul>"},{"location":"#145-ovh","title":"1.4.5 OVH","text":"<p>The default quotas set by OVH are sufficient to build the Magic Castle OVH examples. To increase the limits, or request access to special resources like GPUs, refer to OVHcloud - Increasing Public Cloud quotas.</p>"},{"location":"#2-cloud-cluster-architecture-overview","title":"2. Cloud Cluster Architecture Overview","text":""},{"location":"#3-initialization","title":"3. Initialization","text":""},{"location":"#31-main-file","title":"3.1 Main File","text":"<ol> <li>Go to https://github.com/ComputeCanada/magic_castle/releases.</li> <li>Download the latest release of Magic Castle for your cloud provider.</li> <li>Open a Terminal.</li> <li>Uncompress the release: <code>tar xvf magic_castle*.tar.gz</code></li> <li>Rename the release folder after your favourite superhero: <code>mv magic_castle* hulk</code></li> <li>Move inside the folder: <code>cd hulk</code></li> </ol> <p>The file <code>main.tf</code> contains Terraform modules and outputs. Modules are files that define a set of resources that will be configured based on the inputs provided in the module block. Outputs are used to tell Terraform which variables of our module we would like to be shown on the screen once the resources have been instantiated.</p> <p>This file will be our main canvas to design our new clusters. As long as the module block parameters suffice to our need, we will be able to limit our configuration to this sole file. Further customization will be addressed during the second part of the workshop.</p>"},{"location":"#32-terraform","title":"3.2 Terraform","text":"<p>Terraform fetches the plugins required to interact with the cloud provider defined by our <code>main.tf</code> once when we initialize. To initialize, enter the following command: <pre><code>terraform init\n</code></pre></p> <p>The initialization is specific to the folder where you are currently located. The initialization process looks at all <code>.tf</code> files and fetches the plugins required to build the resources defined in these files. If you replace some or all <code>.tf</code> files inside a folder that has already been initialized, just call the command again to make sure you have all plugins.</p> <p>The initialization process creates a <code>.terraform</code> folder at the root of your current folder. You do not need to look at its content for now.</p>"},{"location":"#321-terraform-modules-upgrade","title":"3.2.1 Terraform Modules Upgrade","text":"<p>Once Terraform folder has been initialized, it is possible to fetch the newest version of the modules used by calling: <pre><code>terraform init -upgrade\n</code></pre></p>"},{"location":"#4-configuration","title":"4. Configuration","text":"<p>In the <code>main.tf</code> file, there is a module named after your cloud provider, i.e.: <code>module \"openstack\"</code>. This module corresponds to the high-level infrastructure of your cluster.</p> <p>The following sections describes each variable that can be used to customize the deployed infrastructure and its configuration. Optional variables can be absent from the example module. The order of the variables does not matter, but the following sections are ordered as the variables appear in the examples.</p>"},{"location":"#41-source","title":"4.1 source","text":"<p>The first line of the module block indicates to Terraform where it can find the files that define the resources that will compose your cluster. In the releases, this variable is a relative path to the cloud provider folder (i.e.: <code>./aws</code>).</p> <p>Requirement: Must be a path to a local folder containing the Magic Castle Terraform files for the cloud provider of your choice. It can also be a git repository. Refer to Terraform documentation on module source for more information.</p> <p>Post build modification effect: <code>terraform init</code> will have to be called again and the next <code>terraform apply</code> might propose changes if the infrastructure describe by the new module is different.</p>"},{"location":"#42-config_git_url","title":"4.2 config_git_url","text":"<p>Magic Castle configuration management is handled by Puppet. The Puppet configuration files are stored in a git repository. This is typically ComputeCanada/puppet-magic_castle repository on GitHub.</p> <p>Leave this variable to its current value to deploy a vanilla Magic Castle cluster.</p> <p>If you wish to customize the instances' role assignment, add services, or develop new features for Magic Castle, fork the ComputeCanada/puppet-magic_castle and point this variable to your fork's URL. For more information on Magic Castle puppet configuration customization, refer to MC developer documentation.</p> <p>Requirement: Must be a valid HTTPS URL to a git repository describing a Puppet environment compatible with Magic Castle. If the repo is private, generate an access token with a permission to read the repo content, and provide the token in the <code>config_git_url</code> like this: <pre><code>config_git_url = \"https://oauth2:${oauth-key-goes-here}@domain.com/username/repo.git\"\n</code></pre> This works for GitHub and GitLab (including community edition).</p> <p>Post build modification effect: no effect. To change the Puppet configuration source, destroy the cluster or change it manually on the Puppet server.</p>"},{"location":"#43-config_version","title":"4.3 config_version","text":"<p>Since Magic Cluster configuration is managed with git, it is possible to specify which version of the configuration you wish to use. Typically, it will match the version number of the release you have downloaded (i.e: <code>9.3</code>).</p> <p>Requirement: Must refer to a git commit, tag or branch existing in the git repository pointed by <code>config_git_url</code>.</p> <p>Post build modification effect: none. To change the Puppet configuration version, destroy the cluster or change it manually on the Puppet server.</p>"},{"location":"#44-cluster_name","title":"4.4 cluster_name","text":"<p>Defines the <code>ClusterName</code> variable in <code>slurm.conf</code> and the name of the cluster in the Slurm accounting database (see <code>slurm.conf</code> documentation).</p> <p>Requirement: Must be lowercase alphanumeric characters and start with a letter. It can include dashes. cluster_name must be 40 characters or less.</p> <p>Post build modification effect: destroy and re-create all instances at next <code>terraform apply</code>.</p>"},{"location":"#45-domain","title":"4.5 domain","text":"<p>Defines</p> <ul> <li>the Kerberos realm name when initializing FreeIPA.</li> <li>the internal domain name and the <code>resolv.conf</code> search domain as <code>int.{cluster_name}.{domain}</code></li> </ul> <p>Optional modules following the current module in the example <code>main.tf</code> can be used to register DNS records in relation to your cluster if the DNS zone of this domain is administered by one of the supported providers. Refer to section 6. DNS Configuration for more details.</p> <p>Requirements:</p> <ul> <li>Must be a fully qualified DNS name and RFC-1035-valid. Valid format is a series of labels 1-63 characters long matching the regular expression <code>[a-z]([-a-z0-9]*[a-z0-9])</code>, concatenated with periods.</li> <li>No wildcard record A of the form <code>*.domain. IN A x.x.x.x</code> exists for that domain. You can verify no such record exist with <code>dig</code>:     <pre><code>dig +short '*.${domain}'\n</code></pre></li> </ul> <p>Post build modification effect: destroy and re-create all instances at next <code>terraform apply</code>.</p>"},{"location":"#46-image","title":"4.6 image","text":"<p>Defines the name of the image that will be used as the base image for the cluster nodes.</p> <p>You can use a custom image if you wish, but configuration management should be mainly done through Puppet. Image customization is mostly envisioned as a way to accelerate the configuration process by applying the security patches and OS updates in advance.</p> <p>To specify a different image for an instance type, use the <code>image</code> instance attribute</p> <p>Requirements: the operating system on the image must be from the RedHat family. This includes CentOS (8, 9), Rocky Linux (8, 9), and AlmaLinux (8, 9).</p> <p>Post build modification effect: none. If this variable is modified, existing instances will ignore the change and future instances will use the new value.</p>"},{"location":"#461-aws","title":"4.6.1 AWS","text":"<p>The image field needs to correspond to the Amazon Machine Image (AMI) ID. AMI IDs are specific to regions and architectures. Make sure to use the right ID for the region and CPU architecture you are using (i.e: x86_64).</p> <p>To find out which AMI ID you need to use, refer to - AlmaLinux OS Amazon Web Services AMIs - CentOS list of official images available on the AWS Marketplace - Rocky Linux</p> <p>Note: Before you can use the AMI, you will need to accept the usage terms and subscribe to the image on AWS Marketplace. On your first deployment, you will be presented an error similar to this one: <pre><code>\u2502 Error: Error launching source instance: OptInRequired: In order to use this AWS Marketplace product you need to accept terms and subscribe. To do so please visit https://aws.amazon.com/marketplace/pp?sku=cvugziknvmxgqna9noibqnnsy\n\u2502   status code: 401, request id: 1f04a85a-f16a-41c6-82b5-342dc3dd6a3d\n\u2502\n\u2502   on aws/infrastructure.tf line 67, in resource \"aws_instance\" \"instances\":\n\u2502   67: resource \"aws_instance\" \"instances\" {\n</code></pre> To accept the terms and fix the error, visit the link provided in the error output, then click on the <code>Click to Subscribe</code> yellow button.</p>"},{"location":"#462-microsoft-azure","title":"4.6.2 Microsoft Azure","text":"<p>The image field for Azure can either be a string or a map.</p> <p>A string image specification will correspond to the image id. Image ids can be retrieved using the following command-line: <pre><code>az image builder list\n</code></pre></p> <p>A map image specification needs to contain the following fields <code>publisher</code>, <code>offer</code> <code>sku</code>, and optionally <code>version</code>. The map is used to specify images found in Azure Marketplace. Here is an example: <pre><code>{\n    publisher = \"OpenLogic\",\n    offer     = \"CentOS-CI\",\n    sku       = \"7-CI\"\n}\n</code></pre></p>"},{"location":"#463-openstack","title":"4.6.3 OpenStack","text":"<p>The image name can be a regular expression. If more than one image is returned by the query to OpenStack, the most recent is selected.</p>"},{"location":"#47-instances","title":"4.7 instances","text":"<p>The <code>instances</code> variable is a map that defines the virtual machines that will form the cluster. The map' keys define the hostnames and the values are the attributes of the virtual machines.</p> <p>Each instance is identified by a unique hostname. An instance's hostname is written as the key followed by its index (1-based). The following map: <pre><code>instances = {\n  mgmt     = { type = \"p2-4gb\", tags = [...] },\n  login    = { type = \"p2-4gb\",     count = 1, tags = [...] },\n  node     = { type = \"c2-15gb-31\", count = 2, tags = [...] },\n  gpu-node = { type = \"gpu2.large\", count = 3, tags = [...] },\n}\n</code></pre> will spawn instances with the following hostnames: <pre><code>mgmt1\nlogin1\nnode1\nnode2\ngpu-node1\ngpu-node2\ngpu-node3\n</code></pre></p> <p>Hostnames must follow a set of rules, from <code>hostname</code> man page:</p> <p>Valid characters for hostnames are ASCII letters from a to z, the digits from 0 to 9, and the hyphen (-). A hostname may not start with a hyphen.</p> <p>Two attributes are expected to be defined for each instance: 1. <code>type</code>: name for varying combinations of CPU, memory, GPU, etc. (i.e: <code>t2.medium</code>); 2. <code>tags</code>: list of labels that defines the role of the instance.</p>"},{"location":"#471-tags","title":"4.7.1 tags","text":"<p>Tags are used in the Terraform code to identify if devices (volume, network) need to be attached to an instance, while in Puppet code tags are used to identify roles of the instances.</p> <p>Terraform tags:</p> <ul> <li><code>login</code>: identify instances accessible with SSH from Internet and pointed by the domain name A records</li> <li><code>pool</code>: identify instances created only when their hostname appears in the <code>var.pool</code> list.</li> <li><code>proxy</code>: identify instances accessible with HTTP/HTTPS and pointed by the vhost A records</li> <li><code>public</code>: identify instances that need to have a public ip address reachable from Internet</li> <li><code>puppet</code>: identify instances configured as Puppet servers</li> <li><code>spot</code>: identify instances that are to be spawned as spot/preemptible instances. This tag is supported in AWS, Azure and GCP. It is ignored by OpenStack and OVH.</li> <li><code>efa</code>: attach an Elastic Fabric Adapter network interface to the instance. This tag is supported in AWS.</li> </ul> <p>Puppet tags expected by the puppet-magic_castle environment.</p> <ul> <li><code>login</code>: identify a login instance (minimum: 2 CPUs, 2GB RAM)</li> <li><code>mgmt</code>: identify a management instance i.e: FreeIPA server, Slurm controller, Slurm DB (minimum: 2 CPUs, 6GB RAM)</li> <li><code>nfs</code>: identify the instance that acts as an NFS server.</li> <li><code>node</code>: identify a compute node instance (minimum: 1 CPUs, 2GB RAM)</li> <li><code>pool</code>: when combined with <code>node</code>, it identifies compute nodes that Slurm can resume/suspend to meet workload demand.</li> <li><code>proxy</code>: identify the instance that executes the Caddy reverse proxy and JupyterHub.</li> </ul> <p>In the Magic Castle Puppet environment, an instance cannot be tagged as <code>mgmt</code> and <code>proxy</code>.</p> <p>You are free to define your own additional tags.</p>"},{"location":"#472-optional-attributes","title":"4.7.2 Optional attributes","text":"<p>Optional attributes can be defined:</p> <ol> <li><code>count</code>: number of virtual machines with this combination of hostname prefix, type and tags to create (default: 1).</li> <li><code>image</code>: specification of the image to use for this instance type. (default: global <code>image</code> value). Refer to section 10.12 - Create a compute node image to learn how this attribute can be leveraged to accelerate compute node configuration.</li> <li> <p><code>disk_type</code>: type of the instance's root disk (default: see the next table).</p> Provider <code>disk_type</code> <code>disk_size</code> (GiB) Azure <code>Premium_LRS</code> 30 AWS <code>gp2</code> 10 GCP <code>pd-ssd</code> 20 OpenStack <code>null</code> 10 OVH <code>null</code> 10 </li> <li> <p><code>disk_size</code>: size in gibibytes (GiB) of the instance's root disk containing the operating system and service software (default: see the previous table).</p> </li> <li><code>mig</code>: map of NVIDIA Multi-Instance GPU (MIG) short profile names and count used to partition the instances' GPU, example for an A100:     <pre><code>mig = { \"1g.5gb\" = 2, \"2g.10gb\" = 1, \"3g.20gb\" = 1 }\n</code></pre>     This is only functional with MIG supported GPUs,     and with x86-64 processors (see NVIDIA/mig-parted issue #30).</li> <li><code>shard</code>: total number of Sharding on the node. Sharding allows sharing the same GPU on multiple jobs. The total number of shards is evenly distributed across all GPUs on the node.</li> </ol> <p>For some cloud providers, it possible to define additional attributes. The following sections present the available attributes per provider.</p>"},{"location":"#aws","title":"AWS","text":"<p>For instances with the <code>spot</code> tags, these attributes can also be set:</p> <ul> <li><code>wait_for_fulfillment</code> (default: true)</li> <li><code>spot_type</code> (default: permanent)</li> <li><code>instance_interruption_behavior</code> (default: stop)</li> <li><code>spot_price</code> (default: not set)</li> <li><code>block_duration_minutes</code> (default: not set) [note 1] For more information on these attributes, refer to <code>aws_spot_instance_request</code> argument reference</li> </ul> <p>Note 1: <code>block_duration_minutes</code> is not available to new AWS accounts or accounts without billing history - AWS EC2 Spot Instance requests. When not available, its usage can trigger quota errors like this: <pre><code>Error requesting spot instances: MaxSpotInstanceCountExceeded: Max spot instance count exceeded\n</code></pre></p>"},{"location":"#azure","title":"Azure","text":"<p>For instances with the <code>spot</code> tags, these attributes can also be set:</p> <ul> <li><code>max_bid_price</code> (default: not set)</li> <li><code>eviction_policy</code> (default: <code>Deallocate</code>) For more information on these attributes, refer to <code>azurerm_linux_virtual_machine</code> argument reference</li> </ul>"},{"location":"#gcp","title":"GCP","text":"<ul> <li><code>gpu_type</code>: name of the GPU model to attach to the instance. Refer to Google Cloud documentation for the list of available models per region</li> <li><code>gpu_count</code>: number of GPUs of the <code>gpu_type</code> model to attach to the instance</li> </ul>"},{"location":"#473-post-build-modification-effect","title":"4.7.3 Post build modification effect","text":"<p>Modifying any part of the map after the cluster is built will only affect the type of instances associated with what was modified at the next <code>terraform apply</code>.</p>"},{"location":"#48-volumes","title":"4.8 volumes","text":"<p>The <code>volumes</code> variable is a map that defines the block devices that should be attached to instances that have the corresponding key in their list of tags. To each instance with the tag, unique block devices are attached, no multi-instance attachment is supported.</p> <p>Each volume in map is defined a key corresponding to its and a map of attributes:</p> <ul> <li><code>size</code>: size of the block device in GB.</li> <li><code>type</code> (optional): type of volume to use. Default value per provider:</li> <li>Azure: <code>Premium_LRS</code></li> <li>AWS: <code>gp2</code></li> <li>GCP: <code>pd-ssd</code></li> <li>OpenStack: <code>null</code></li> <li>OVH: <code>null</code></li> </ul> <p>Volumes with a tag that have no corresponding instance will not be created.</p> <p>In the following example: <pre><code>instances = {\u00a0\n  server = { type = \"p4-6gb\", tags = [\"nfs\"] }\n}\nvolumes = {\n  nfs = {\n    home = { size = 100 }\n    project = { size = 100 }\n    scratch = { size = 100 }\n  }\n  mds = {\n    oss1 = { size = 500 }\n    oss2 = { size = 500 }\n  }\n}\n</code></pre></p> <p>The instance <code>server1</code> will have three volumes attached to it. The volumes tagged <code>mds</code> are not created since no instances have the corresponding tag.</p> <p>To define an infrastructure with no volumes, set the <code>volumes</code> variable to an empty map: <pre><code>volumes = {}\n</code></pre></p> <p>Post build modification effect: destruction of the corresponding volumes and attachments, and creation of new empty volumes and attachments. If an no instance with a corresponding tag exist following modifications, the volumes will be deleted.</p>"},{"location":"#49-public_keys","title":"4.9 public_keys","text":"<p>List of SSH public keys that will have access to your cluster sudoer account.</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. The sudoer account <code>authorized_keys</code> file will be updated by each instance's Puppet agent following the copy of the hieradata files.</p>"},{"location":"#410-nb_users-optional","title":"4.10 nb_users (optional)","text":"<p>default value: 0</p> <p>Defines how many guest user accounts will be created in FreeIPA. Each user account shares the same randomly generated password. The usernames are defined as <code>userX</code> where <code>X</code> is a number between 1 and the value of <code>nb_users</code> (zero-padded, i.e.: <code>user01 if X &lt; 100</code>, <code>user1 if X &lt; 10</code>).</p> <p>If an NFS NFS <code>home</code> volume is defined, each user will have a home folder on a shared NFS storage hosted on the NFS server node.</p> <p>User accounts do not have sudoer privileges. If you wish to use <code>sudo</code>, you will have to login using the sudoer account and the SSH keys listed in <code>public_keys</code>.</p> <p>If you would like to add a user account after the cluster is built, refer to section 10.3 and 10.4.</p> <p>Requirement: Must be an integer, minimum value is 0.</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. If <code>nb_users</code> is increased, new guest accounts will be created during the following Puppet run on <code>mgmt1</code>. If <code>nb_users</code> is decreased, it will have no effect: the guest accounts already created will be left intact.</p>"},{"location":"#411-guest_passwd-optional","title":"4.11 guest_passwd (optional)","text":"<p>default value: 4 random words separated by dots</p> <p>Defines the password for the guest user accounts instead of using a randomly generated one.</p> <p>Requirement: Minimum length 8 characters.</p> <p>The password can be provided in a PKCS7 encrypted form. Refer to sub-section 4.15 eyaml_key for instructions on how to encrypt the password.</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. Password of all guest accounts will be changed to match the new password value.</p>"},{"location":"#412-sudoer_username-optional","title":"4.12 sudoer_username (optional)","text":"<p>default value: <code>centos</code></p> <p>Defines the username of the account with sudo privileges. The account ssh authorized keys are configured with the SSH public keys with <code>public_keys</code>.</p> <p>Post build modification effect: none. To change sudoer username, destroy the cluster or redefine the value of <code>profile::base::sudoer_username</code> in <code>hieradata</code>.</p>"},{"location":"#413-hieradata-optional","title":"4.13 hieradata (optional)","text":"<p>default value: empty string</p> <p>Defines custom variable values that are injected in the Puppet hieradata file. Useful to override common configuration of Puppet classes.</p> <p>List of useful examples:</p> <ul> <li>Receive logs of Puppet runs with changes to your email, add the following line to the string:     <pre><code>profile::base::admin_email: \"me@example.org\"\n</code></pre></li> <li>Define ip addresses that can never be banned by fail2ban:     <pre><code>profile::fail2ban::ignore_ip: ['132.203.0.0/16', '8.8.8.8']\n</code></pre></li> <li>Remove one-time password field from JupyterHub login page:     <pre><code>jupyterhub::enable_otp_auth: false\n</code></pre></li> <li>Setup AlertManager to receive Prometheus alerts on Slack:     <pre><code>prometheus::alertmanager::route:\n  group_by:\n    - 'alertname'\n    - 'cluster'\n    - 'service'\n  group_wait: '5s'\n  group_interval: '5m'\n  repeat_interval: '3h'\n  receiver: 'slack'\n\nprometheus::alertmanager::receivers:\n  - name: 'slack'\n    slack_configs:\n      - api_url: 'https://hooks.slack.com/services/ABCDEFG123456'\n        channel: \"#channel\"\n        send_resolved: true\n        username: 'username'\n</code></pre></li> </ul> <p>Refer to the following Puppet modules' documentation to know more about the key-values that can be defined:</p> <ul> <li>puppet-magic_castle</li> <li>puppet-jupyterhub</li> <li>puppet-prometheus</li> </ul> <p>The file created from this string can be found on the Puppet server as <code>/etc/puppetlabs/data/user_data.yaml</code></p> <p>Requirement: The string needs to respect the YAML syntax.</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. Each instance's Puppet agent will be reloaded following the copy of the hieradata files.</p>"},{"location":"#414-hieradata_dir-optional","title":"4.14 hieradata_dir (optional)","text":"<p>default_value: Empty string</p> <p>Defines the path to a directory containing a hierarchy of YAML data files. The hierarchy is copied on the Puppet server in <code>/etc/puppetlabs/data/user_data</code>.</p> <p>Hierarchy structure:</p> <ul> <li>per node hostname:</li> <li><code>&lt;dir&gt;/hostnames/&lt;hostname&gt;/*.yaml</code></li> <li><code>&lt;dir&gt;/hostnames/&lt;hostname&gt;.yaml</code></li> <li>per node prefix:</li> <li><code>&lt;dir&gt;/prefixes/&lt;prefix&gt;/*.yaml</code></li> <li><code>&lt;dir&gt;/prefixes/&lt;prefix&gt;.yaml</code></li> <li>all nodes: <code>&lt;dir&gt;/*.yaml</code></li> </ul> <p>For more information on hieradata, refer to section 4.13 hieradata (optional).</p> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>. Each instance's Puppet agent will be reloaded following the copy of the hieradata files.</p>"},{"location":"#415-eyaml_key-optional","title":"4.15 eyaml_key (optional)","text":"<p>default value: empty string</p> <p>Defines the private RSA key required to decrypt the values encrypted with hiera-eyaml PKCS7. This key will be copied on the Puppet server.</p> <p>Post build modification effect: trigger scp of private key file at next <code>terraform apply</code>.</p>"},{"location":"#4151-generate-eyaml-encryption-keys","title":"4.15.1 Generate eyaml encryption keys","text":"<p>If you plan to track the cluster configuration files in git (i.e:<code>main.tf</code>, <code>user_data.yaml</code>), it would be a good idea to encrypt the sensitive property values.</p> <p>Magic Castle uses hiera-eyaml to provide a per-value encryption of sensitive properties to be used by Puppet.</p> <p>The private key and its corresponding public key wrapped in a X509 certificate can be generated with <code>openssl</code>:</p> <pre><code>openssl req -x509 -nodes -newkey rsa:2048 -keyout private_key.pkcs7.pem -out public_key.pkcs7.pem -batch\n</code></pre> <p>or with <code>eyaml</code>:</p> <pre><code>eyaml createkeys --pkcs7-public-key=public_key.pkcs7.pem --pkcs7-private-key=private_key.pkcs7.pem\n</code></pre>"},{"location":"#4152-encrypting-sensitive-properties","title":"4.15.2 Encrypting sensitive properties","text":"<p>To encrypt a sensitive property with openssl: <pre><code>echo -n 'your-secret' | openssl smime -encrypt -aes-256-cbc -outform der public_key.pkcs7.pem | openssl base64 -A | xargs printf \"ENC[PKCS7,%s]\\n\"\n</code></pre></p> <p>To encrypt a sensitive property with eyaml: <pre><code>eyaml encrypt -s 'your-secret' --pkcs7-public-key=public_key.pkcs7.pem -o string\n</code></pre></p>"},{"location":"#4153-terraform-cloud","title":"4.15.3 Terraform cloud","text":"<p>To provide the value of this variable via Terraform Cloud, encode the private key content with base64:</p> <pre><code>openssl base64 -A -in private_key.pkcs7.pem\n</code></pre> <p>Define a variable in your main.tf:</p> <pre><code>variable \"tfc_eyaml_key\" {}\nmodule \"openstack\" {\n  ...\n}\n</code></pre> <p>Then make sure to decode it before passing it to the cloud provider module:</p> <pre><code>variable \"tfc_eyaml_key\" {}\nmodule \"openstack\" {\n  ...\n  eyaml_key = base64decode(var.tfc_eyaml_key)\n  ...\n}\n</code></pre>"},{"location":"#416-firewall_rules-optional","title":"4.16 firewall_rules (optional)","text":"<p>default value: <pre><code>{\n  ssh     = { \"from_port\" = 22,    \"to_port\" = 22,    tag = \"login\", \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" },\n  http    = { \"from_port\" = 80,    \"to_port\" = 80,    tag = \"proxy\", \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" },\n  https   = { \"from_port\" = 443,   \"to_port\" = 443,   tag = \"proxy\", \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" },\n  globus  = { \"from_port\" = 2811,  \"to_port\" = 2811,  tag = \"dtn\",   \"protocol\" = \"tcp\", \"cidr\" = \"54.237.254.192/29\" },\n  myproxy = { \"from_port\" = 7512,  \"to_port\" = 7512,  tag = \"dtn\",   \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" },\n  gridftp = { \"from_port\" = 50000, \"to_port\" = 51000, tag = \"dtn\",   \"protocol\" = \"tcp\", \"cidr\" = \"0.0.0.0/0\" }\n}\n</code></pre></p> <p>Defines a map of firewall rules that control external traffic to the public nodes. Each rule is defined as a map of key-value pairs and has to be assigned a unique name:</p> <ul> <li><code>from_port</code> (req.):  the lower part of the allowed port range, valid integer value needs to be between 1 and 65535.</li> <li><code>to_port</code> (req.): the higher part of the allowed port range, valid integer value needs to be between 1 and 65535.</li> <li><code>tag</code> (req.): instances with this tag will be assigned this firewall rule.</li> <li><code>ethertype</code> (opt. default: <code>\"IPv4\"</code>): the layer 3 protocol type (<code>\"IPv4\"</code> or <code>\"IPv6\"</code>).</li> <li><code>protocol</code> (opt. default: <code>\"tcp\"</code>): the layer 4 protocol type.</li> <li><code>cidr</code> (opt. default: <code>\"0.0.0.0/0\"</code>): the remote CIDR, the value needs to be a valid CIDR (i.e. <code>192.168.0.0/16</code>).</li> </ul> <p>If you would like Magic Castle to be able to transfer files and update the state of the cluster in Puppet, make sure there exists at least one effective firewall rule where <code>from_port &lt;= 22 &lt;= to_port</code> and for which the external IP address of the machine that executes Terraform is in the CIDR range (i.e: <code>cidr = \"0.0.0.0/0\"</code> being the most permissive). This corresponds to the <code>ssh</code> rule in the default firewall rule map. This guarantees that Terraform will be able to use SSH to connect to the cluster from anywhere. For more information about this requirement, refer to Magic Castle's bastion tag computation code.</p> <p>Post build modification effect: modify the cloud provider firewall rules at next <code>terraform apply</code>.</p>"},{"location":"#418-software_stack-optional","title":"4.18 software_stack (optional)","text":"<p>default_value: <code>\"alliance\"</code></p> <p>Defines the scientific software environment that users have access when they login. Possible values are:</p> <ul> <li>default - <code>\"alliance\"</code> / <code>\"computecanada\"</code>: Digital Alliance Research Alliance of Canada scientific software environment (previously Compute Canada environment)</li> <li><code>\"eessi\"</code>: European Environment for Scientific Software Installation (EESSI)</li> <li><code>null</code> / <code>\"\"</code>: no scientific software environment</li> </ul> <p>Post build modification effect: trigger scp of hieradata files at next <code>terraform apply</code>.</p>"},{"location":"#419-pool-optional","title":"4.19 pool (optional)","text":"<p>default_value: <code>[]</code></p> <p>Defines a list of hostnames with the tag <code>\"pool\"</code> that have to be online. This variable is typically managed by the workload scheduler through Terraform API. For more information, refer to Enable Magic Castle Autoscaling</p> <p>Post build modification effect: <code>pool</code> tagged hosts with name present in the list will be instantiated, others will stay uninstantiated or will be destroyed if previously instantiated.</p>"},{"location":"#420-skip_upgrade-optional","title":"4.20 skip_upgrade (optional)","text":"<p>default_value = <code>false</code></p> <p>If true, the base image packages will not be upgraded during the first boot. By default, all packages are upgraded.</p> <p>Post build modification effect: No effect on currently built instances. Ones created after the modification will take into consideration the new value of the parameter to determine whether they should upgrade the base image packages or not.</p>"},{"location":"#421-puppetfile-optional","title":"4.21 puppetfile (optional)","text":"<p>default_value = <code>\"\"</code></p> <p>Defines a second Puppetfile used to install complementary modules with r10k.</p> <p>Post build modification effect: trigger scp of Puppetfile at next <code>terraform apply</code>. Each instance's Puppet agent will be reloaded following the installation of the new modules.</p>"},{"location":"#5-cloud-specific-configuration","title":"5. Cloud Specific Configuration","text":""},{"location":"#51-amazon-web-services","title":"5.1 Amazon Web Services","text":""},{"location":"#511-region","title":"5.1.1 region","text":"<p>Defines the label of the AWS EC2 region where the cluster will be created (i.e.: <code>us-east-2</code>).</p> <p>Requirement: Must be in the list of available EC2 regions.</p> <p>Post build modification effect: rebuild of all resources at next <code>terraform apply</code>.</p>"},{"location":"#512-availability_zone-optional","title":"5.1.2 availability_zone (optional)","text":"<p>default value: None</p> <p>Defines the label of the data center inside the AWS region where the cluster will be created (i.e.: <code>us-east-2a</code>). If left blank, it chosen at random amongst the availability zones of the selected region.</p> <p>Requirement: Must be in a valid availability zone for the selected region. Refer to AWS documentation to find out how list the availability zones.</p>"},{"location":"#52-microsoft-azure","title":"5.2 Microsoft Azure","text":""},{"location":"#521-location","title":"5.2.1 location","text":"<p>Defines the label of the Azure location where the cluster will be created (i.e.: <code>eastus</code>).</p> <p>Requirement: Must be a valid Azure location. To get the list of available location, you can use Azure CLI : <code>az account list-locations -o table</code>.</p> <p>Post build modification effect: rebuild of all resources at next <code>terraform apply</code>.</p>"},{"location":"#522-azure_resource_group-optional","title":"5.2.2 azure_resource_group (optional)","text":"<p>default value: None</p> <p>Defines the name of an already created resource group to use. Terraform will no longer attempt to manage a resource group for Magic Castle if this variable is defined and will instead create all resources within the provided resource group. Define this if you wish to use an already created resource group or you do not have a subscription-level access to create and destroy resource groups.</p> <p>Post build modification effect: rebuild of all instances at next <code>terraform apply</code>.</p>"},{"location":"#523-plan-optional","title":"5.2.3 plan (optional)","text":"<p>default value: <pre><code>{\n  name      = null\n  product   = null\n  publisher = null\n}\n</code></pre></p> <p>Purchase plan information for Azure Marketplace image. Certain images from Azure Marketplace requires a terms acceptance or a fee to be used. When using this kind of image, you must supply the plan details.</p> <p>For example, to use the official AlmaLinux image, you have to first add it to your account. Then to use it with Magic Castle, you must supply the following plan information: <pre><code>plan = {\n  name      = \"8_7\"\n  product   = \"almalinux\"\n  publisher = \"almalinux\"\n}\n</code></pre></p>"},{"location":"#53-google-cloud","title":"5.3 Google Cloud","text":""},{"location":"#531-project","title":"5.3.1 project","text":"<p>Defines the label of the unique identifier associated with the Google Cloud project in which the resources will be created. It needs to corresponds to GCP project ID, which is composed of the project name and a randomly assigned number.</p> <p>Requirement: Must be a valid Google Cloud project ID.</p> <p>Post build modification effect: rebuild of all resources at next <code>terraform apply</code>.</p>"},{"location":"#532-region","title":"5.3.2 region","text":"<p>Defines the name of the specific geographical location where the cluster resources will be hosted.</p> <p>Requirement: Must be a valid Google Cloud region. Refer to Google Cloud documentation for the list of available regions and their characteristics.</p>"},{"location":"#533-zone-optional","title":"5.3.3 zone (optional)","text":"<p>default value: None</p> <p>Defines the name of the zone within the region where the cluster resources will be hosted.</p> <p>Requirement: Must be a valid Google Cloud zone. Refer to Google Cloud documentation for the list of available zones and their characteristics.</p>"},{"location":"#54-openstack-and-ovh","title":"5.4 OpenStack and OVH","text":""},{"location":"#541-os_floating_ips-optional","title":"5.4.1 os_floating_ips (optional)","text":"<p>default value: <code>{}</code></p> <p>Defines a map as an association of instance names (key) to pre-allocated floating ip addresses (value). Example: <pre><code>  os_floating_ips = {\n    login1 = \"132.213.13.59\"\n    login2 = \"132.213.13.25\"\n  }\n</code></pre></p> <ul> <li>instances tagged as public that have an entry in this map will be assigned the corresponding ip address;</li> <li>instances tagged as public that do not have an entry in this map will be assigned a floating ip managed by Terraform.</li> <li>instances not tagged as public that have an entry in this map will not be assigned a floating ip.</li> </ul> <p>This variable can be useful if you manage your DNS manually and you would like the keep the same domain name for your cluster at each build.</p> <p>Post build modification effect: change the floating ips assigned to the public instances.</p>"},{"location":"#542-os_ext_network-optional","title":"5.4.2 os_ext_network (optional)","text":"<p>default value: None</p> <p>Defines the name of the external network that provides the floating ips. Define this only if your OpenStack cloud provides multiple external networks, otherwise, Terraform can find it automatically.</p> <p>Post build modification effect: change the floating ips assigned to the public nodes.</p>"},{"location":"#544-subnet_id-optional","title":"5.4.4 subnet_id (optional)","text":"<p>default value: None</p> <p>Defines the ID of the internal IPV4 subnet to which the instances are connected. Define this if you have or intend to have more than one subnets defined in your OpenStack project. Otherwise, Terraform can find it automatically. Can be used to force a v4 subnet when both v4 and v6 exist.</p> <p>Post build modification effect: rebuild of all instances at next <code>terraform apply</code>.</p>"},{"location":"#6-dns-configuration","title":"6. DNS Configuration","text":"<p>Some functionalities in Magic Castle require the registration of DNS records under the cluster name in the selected domain. This includes web services like JupyterHub, Mokey and FreeIPA web portal.</p> <p>If your domain DNS records are managed by one of the supported providers, follow the instructions in the corresponding sections to have the cluster's DNS records created and tracked by Magic Castle.</p> <p>If your DNS provider is not supported, you can manually create the records. Refer to the subsection 6.3 for more details.</p>"},{"location":"#61-cloudflare","title":"6.1 Cloudflare","text":"<ol> <li>Uncomment the <code>dns</code> module for Cloudflare in your <code>main.tf</code>.</li> <li>Uncomment the <code>output \"hostnames\"</code> block.</li> <li>Download and install the Cloudflare Terraform module: <code>terraform init</code>.</li> <li>Export the environment variables <code>CLOUDFLARE_EMAIL</code> and <code>CLOUDFLARE_API_KEY</code>, where <code>CLOUDFLARE_EMAIL</code> is your Cloudflare account email address and <code>CLOUDFLARE_API_KEY</code> is your account Global API Key available in your Cloudflare profile.</li> </ol>"},{"location":"#612-cloudflare-api-token","title":"6.1.2 Cloudflare API Token","text":"<p>If you prefer using an API token instead of the global API key, you will need to configure a token with the following four permissions with the Cloudflare API Token interface.</p> Section Subsection Permission Zone DNS Edit <p>Instead of step 5, export only <code>CLOUDFLARE_API_TOKEN</code>, <code>CLOUDFLARE_ZONE_API_TOKEN</code>, and <code>CLOUDFLARE_DNS_API_TOKEN</code> equal to the API token generated previously.</p>"},{"location":"#62-google-cloud","title":"6.2 Google Cloud","text":"<p>requirement: Install the Google Cloud SDK</p> <ol> <li>Login to your Google account with gcloud CLI : <code>gcloud auth application-default login</code></li> <li>Uncomment the <code>dns</code> module for Google Cloud in your <code>main.tf</code>.</li> <li>Uncomment the <code>output \"hostnames\"</code> block.</li> <li>In <code>main.tf</code>'s <code>dns</code> module, configure the variables <code>project</code> and <code>zone_name</code> with their respective values as defined by your Google Cloud project.</li> <li>Download and install the Google Cloud Terraform module: <code>terraform init</code>.</li> </ol>"},{"location":"#63-unsupported-providers","title":"6.3 Unsupported providers","text":"<p>If your DNS provider is not currently supported by Magic Castle, you can create the DNS records manually.</p> <p>Magic Castle provides a module that creates a text file with the DNS records that can then be imported manually in your DNS zone. To use this module, add the following snippet to your <code>main.tf</code>:</p> <pre><code>module \"dns\" {\n    source           = \"./dns/txt\"\n    name             = module.openstack.cluster_name\n    domain           = module.openstack.domain\n    public_instances = module.openstack.public_instances\n}\n</code></pre> <p>Find and replace <code>openstack</code> in the previous snippet by your cloud provider of choice if not OpenStack (i.e: <code>aws</code>, <code>gcp</code>, etc.).</p> <p>The file will be created after the <code>terraform apply</code> in the same folder as your <code>main.tf</code> and will be named as <code>${name}.${domain}.txt</code>.</p>"},{"location":"#65-sshfp-records-and-dnssec","title":"6.5 SSHFP records and DNSSEC","text":"<p>Magic Castle DNS module creates SSHFP records for all instances with a public ip address. These records can be used by SSH clients to verify the SSH host keys of the server. If DNSSEC is enabled for the domain and the SSH client is correctly configured, no host key confirmation will be prompted when connecting to the server.</p> <p>For more information on how to activate DNSSEC, refer to your DNS provider documentation:</p> <ul> <li>CloudFlare - Enable DNSSEC</li> <li>Google Cloud - Manage DNSSEC configuration</li> </ul> <p>To setup an SSH client to use SSHFP records, add <pre><code>VerifyHostKeyDNS yes\n</code></pre> to its configuration file (i.e.: <code>~/.ssh/config</code>).</p>"},{"location":"#7-planning","title":"7. Planning","text":"<p>Once your initial cluster configuration is done, you can initiate a planning phase where you will ask Terraform to communicate with your cloud provider and verify that your cluster can be built as it is described by the <code>main.tf</code> configuration file.</p> <p>Terraform should now be able to communicate with your cloud provider. To test your configuration file, enter the following command <pre><code>terraform plan\n</code></pre></p> <p>This command will validate the syntax of your configuration file and communicate with the provider, but it will not create new resources. It is only a dry-run. If Terraform does not report any error, you can move to the next step. Otherwise, read the errors and fix your configuration file accordingly.</p>"},{"location":"#8-deployment","title":"8. Deployment","text":"<p>To create the resources defined by your main, enter the following command <pre><code>terraform apply\n</code></pre></p> <p>The command will produce the same output as the <code>plan</code> command, but after the output it will ask for a confirmation to perform the proposed actions. Enter <code>yes</code>.</p> <p>Terraform will then proceed to create the resources defined by the configuration file. It should take a few minutes. Once the creation process is completed, Terraform will output the guest account usernames and password, the sudoer username and the floating ip of the login node.</p> <p>Warning: although the instance creation process is finished once Terraform outputs the connection information, you will not be able to connect and use the cluster immediately. The instance creation is only the first phase of the cluster-building process. The configuration: the creation of the user accounts, installation of FreeIPA, Slurm, configuration of JupyterHub, etc.; takes around 15 minutes after the instances are created.</p> <p>Once it is booted, you can follow an instance configuration process by looking at:</p> <ul> <li><code>/var/log/cloud-init-output.log</code></li> <li><code>journalctl -u puppet</code></li> </ul> <p>If unexpected problems occur during configuration, you can provide these logs to the authors of Magic Castle to help you debug.</p>"},{"location":"#81-deployment-customization","title":"8.1 Deployment Customization","text":"<p>You can modify the <code>main.tf</code> at any point of your cluster's life and apply the modifications while it is running.</p> <p>Warning: Depending on the variables you modify, Terraform might destroy some or all resources, and create new ones. The effects of modifying each variable are detailed in the subsections of Configuration.</p> <p>For example, to increase the number of computes nodes by one. Open <code>main.tf</code>, add 1 to <code>node</code>'s <code>count</code> , save the document and call <pre><code>terraform apply\n</code></pre></p> <p>Terraform will analyze the difference between the current state and the future state, and plan the creation of a single new instance. If you accept the action plan, the instance will be created, provisioned and eventually automatically add to the Slurm cluster configuration.</p> <p>You could do the opposite and reduce the number of compute nodes to 0.</p>"},{"location":"#9-destruction","title":"9. Destruction","text":"<p>Once you're done working with your cluster and you would like to recover the resources, in the same folder as <code>main.tf</code>, enter: <pre><code>terraform destroy -refresh=false\n</code></pre></p> <p>The <code>-refresh=false</code>\u00a0flag is to avoid an issue where one or many of the data sources return no results and stall the cluster destruction with a message like the following: <pre><code>Error: Your query returned no results. Please change your search criteria and try again.\n</code></pre> This type of error happens when for example the specified image no longer exists (see issue #40).</p> <p>As for <code>apply</code>, Terraform will output a plan that you will have to confirm by entering <code>yes</code>.</p> <p>Warning: once the cluster is destroyed, nothing will be left, even the shared storage will be erased.</p>"},{"location":"#91-instance-destruction","title":"9.1 Instance Destruction","text":"<p>It is possible to destroy only the instances and keep the rest of the infrastructure like the floating ip, the volumes, the generated SSH host key, etc. To do so, set the count value of the instance type you wish to destroy to 0.</p>"},{"location":"#92-reset","title":"9.2 Reset","text":"<p>On some occasions, it is desirable to rebuild some of the instances from scratch. Using <code>terraform taint</code>, you can designate resources that will be rebuilt at next application of the plan.</p> <p>To rebuild the first login node : <pre><code>terraform taint 'module.openstack.openstack_compute_instance_v2.instances[\"login1\"]'\nterraform apply\n</code></pre></p>"},{"location":"#10-customize-cluster-software-configuration","title":"10. Customize Cluster Software Configuration","text":"<p>Once the cluster is online and configured, you can modify its configuration as you see fit. We list here how to do most commonly asked for customizations.</p> <p>Some customizations are done from the Puppet server instance (<code>puppet</code>). To connect to the puppet server, follow these steps:</p> <ol> <li>Make sure your SSH key is loaded in your ssh-agent.</li> <li>SSH in your cluster with forwarding of the authentication agent connection enabled: <code>ssh -A centos@cluster_ip</code>. Replace <code>centos</code> by the value of <code>sudoer_username</code> if it is different.</li> <li>SSH in the Puppet server instance: <code>ssh puppet</code></li> </ol> <p>Note on Google Cloud: In GCP, OS Login lets you use Compute Engine IAM roles to manage SSH access to Linux instances. This feature is incompatible with Magic Castle. Therefore, it is turned off in the instances metadata (<code>enable-oslogin=\"FALSE\"</code>). The only account with sudoer rights that can log in the cluster is configured by the variable <code>sudoer_username</code> (default: <code>centos</code>).</p>"},{"location":"#101-disable-puppet","title":"10.1 Disable Puppet","text":"<p>If you plan to modify configuration files manually, you will need to disable Puppet. Otherwise, you might find out that your modifications have disappeared in a 30-minute window.</p> <p>Puppet executes a run every 30 minutes and at reboot. To disable puppet: <pre><code>sudo puppet agent --disable \"&lt;MESSAGE&gt;\"\n</code></pre></p>"},{"location":"#102-replace-the-guest-account-password","title":"10.2 Replace the Guest Account Password","text":"<p>Refer to section 4.11.</p>"},{"location":"#103-add-ldap-users","title":"10.3 Add LDAP Users","text":"<p>Users can be added to Magic Castle LDAP database (FreeIPA) with either one of the following methods: hieradata, command-line, and Mokey web-portal. Each method is presented in the following subsections.</p> <p>New LDAP users are automatically assigned a home folder on NFS.</p> <p>Magic Castle determines if an LDAP user should be member of a Slurm account based on its POSIX groups. When a user is added to a POSIX group, a daemon try to match the group name to the following regular expression: <pre><code>(ctb|def|rpp|rrg)-[a-z0-9_-]*\n</code></pre></p> <p>If there is a match, the user will be added to a Slurm account with the same name, and will gain access to the corresponding project folder under <code>/project</code>.</p> <p>Note: The regular expression represents how Compute Canada names its resources allocation. The regular expression can be redefined, see <code>profile::accounts:::project_regex</code></p>"},{"location":"#1031-hieradata","title":"10.3.1 hieradata","text":"<p>Using the hieradata variable in the <code>main.tf</code>, it is possible to define LDAP users.</p> <p>Examples of LDAP user definition with hieradata are provided in puppet-magic_castle documentation.</p>"},{"location":"#1032-command-line","title":"10.3.2 Command-Line","text":"<p>To add a user account after the cluster is built, log in <code>mgmt1</code> and call: <pre><code>kinit admin\nIPA_GUEST_PASSWD=&lt;new_user_passwd&gt; /sbin/ipa_create_user.py &lt;username&gt; [--group &lt;group_name&gt;]\nkdestroy\n</code></pre></p>"},{"location":"#1033-mokey","title":"10.3.3 Mokey","text":"<p>If user sign-up with Mokey is enabled, users can create their own account at <pre><code>https://mokey.yourcluster.domain.tld/auth/signup\n</code></pre></p> <p>It is possible that an administrator is required to enable the account with Mokey. You can access the administrative panel of FreeIPA at : <pre><code>https://ipa.yourcluster.domain.tld/\n</code></pre></p> <p>The FreeIPA administrator credentials can be retrieved from an encrypted file on the Puppet server. Refer to section 10.14 to know how.</p>"},{"location":"#104-increase-the-number-of-guest-accounts","title":"10.4 Increase the Number of Guest Accounts","text":"<p>To increase the number of guest accounts after creating the cluster with Terraform, simply increase the value of <code>nb_users</code>, then call : <pre><code>terraform apply\n</code></pre></p> <p>Each instance's Puppet agent will be reloaded following the copy of the hieradata files, and the new accounts will be created.</p>"},{"location":"#105-restrict-ssh-access","title":"10.5 Restrict SSH Access","text":"<p>By default, instances tagged <code>login</code> have their port 22 opened to entire world. If you know the range of ip addresses that will connect to your cluster, we strongly recommend that you limit the access to port 22 to this range.</p> <p>To limit the access to port 22, refer to section 4.14 firewall_rules, and replace the <code>cidr</code> of the <code>ssh</code> rule to match the range of ip addresses that have be the allowed to connect to the cluster. If there are more than one range, create multiple rules with distinct names.</p>"},{"location":"#106-add-packages-to-jupyter-default-python-kernel","title":"10.6 Add Packages to Jupyter Default Python Kernel","text":"<p>The default Python kernel corresponds to the Python installed in <code>/opt/ipython-kernel</code>. Each compute node has its own copy of the environment. To add packages to this environment, add the following lines to <code>hieradata</code> in <code>main.tf</code>: <pre><code>jupyterhub::kernel::venv::packages:\n  - package_A\n  - package_B\n  - package_C\n</code></pre></p> <p>and replace <code>package_*</code> by the packages you need to install. Then call: <pre><code>terraform apply\n</code></pre></p>"},{"location":"#107-activate-globus-endpoint","title":"10.7 Activate Globus Endpoint","text":"<p>No longer supported</p>"},{"location":"#108-recovering-from-puppet-rebuild","title":"10.8 Recovering from puppet rebuild","text":"<p>The modifications of some of the parameters in the <code>main.tf</code> file can trigger the rebuild of the <code>puppet</code> instance. This instance hosts the Puppet Server on which depends the Puppet agent of the other instances. When <code>puppet</code> is rebuilt, the other Puppet agents cease to recognize Puppet Server identity since the Puppet Server identity and certificates have been regenerated.</p> <p>To fix the Puppet agents, you will need to apply the following commands on each instance other than <code>puppet</code> once <code>puppet</code> is rebuilt: <pre><code>sudo systemctl stop puppet\nsudo rm -rf /etc/puppetlabs/puppet/ssl/\nsudo systemctl start puppet\n</code></pre></p> <p>Then, on <code>puppet</code>, you will need to sign the new certificate requests made by the instances. First, you can list the requests: <pre><code>sudo /opt/puppetlabs/bin/puppetserver ca list\n</code></pre></p> <p>Then, if every instance is listed, you can sign all requests: <pre><code>sudo /opt/puppetlabs/bin/puppetserver ca sign --all\n</code></pre></p> <p>If you prefer, you can sign individual request by specifying their name: <pre><code>sudo /opt/puppetlabs/bin/puppetserver ca sign --certname NAME[,NAME]\n</code></pre></p>"},{"location":"#109-dealing-with-banned-ip-addresses-fail2ban","title":"10.9 Dealing with banned ip addresses (fail2ban)","text":"<p>Login nodes run fail2ban, an intrusion prevention software that protects login nodes from brute-force attacks. fail2ban is configured to ban ip addresses that attempted to login 20 times and failed in a window of 60 minutes. The ban time is 24 hours.</p> <p>In the context of a workshop with SSH novices, the 20-attempt rule might be triggered, resulting in participants banned and puzzled, which is a bad start for a workshop. There are solutions to mitigate this problem.</p>"},{"location":"#1091-define-a-list-of-ip-addresses-that-can-never-be-banned","title":"10.9.1 Define a list of ip addresses that can never be banned","text":"<p>fail2ban keeps a list of ip addresses that are allowed to fail to login without risking jail time. To add an ip address to that list,  add the following lines to the variable <code>hieradata</code>\u00a0in <code>main.tf</code>: <pre><code>profile::fail2ban::ignoreip:\n  - x.x.x.x\n  - y.y.y.y\n</code></pre> where <code>x.x.x.x</code> and <code>y.y.y.y</code> are ip addresses you want to add to the ignore list. The ip addresses can be written using CIDR notations. The ignore ip list on Magic Castle already includes <code>127.0.0.1/8</code> and the cluster subnet CIDR.</p> <p>Once the line is added, call: <pre><code>terraform apply\n</code></pre></p>"},{"location":"#1092-remove-fail2ban-ssh-route-jail","title":"10.9.2 Remove fail2ban ssh-route jail","text":"<p>fail2ban rule that banned ip addresses that failed to connect with SSH can be disabled. To do so, add the following line to the variable <code>hieradata</code>\u00a0in <code>main.tf</code>: <pre><code>fail2ban::jails: ['ssh-ban-root']\n</code></pre> This will keep the jail that automatically ban any ip that tries to login as root, and remove the ssh failed password jail.</p> <p>Once the line is added, call: <pre><code>terraform apply\n</code></pre></p>"},{"location":"#1093-unban-ip-addresses","title":"10.9.3 Unban ip addresses","text":"<p>fail2ban ban ip addresses by adding rules to iptables. To remove these rules, you need to tell fail2ban to unban the ips.</p> <p>To list the ip addresses that are banned, execute the following command: <pre><code>sudo fail2ban-client status ssh-route\n</code></pre></p> <p>To unban ip addresses, enter the following command followed by the ip addresses you want to unban: <pre><code>sudo fail2ban-client set ssh-route unbanip\n</code></pre></p>"},{"location":"#1094-disable-fail2ban","title":"10.9.4 Disable fail2ban","text":"<p>While this is not recommended, fail2ban can be completely disabled. To do so, add the following line to the variable <code>hieradata</code>\u00a0in <code>main.tf</code>: <pre><code>fail2ban::service_ensure: 'stopped'\n</code></pre></p> <p>then call : <pre><code>terraform apply\n</code></pre></p>"},{"location":"#1011-set-selinux-in-permissive-mode","title":"10.11 Set SELinux in permissive mode","text":"<p>SELinux can be set in permissive mode to debug new workflows that would be prevented by SELinux from working properly. To do so, add the following line to the variable <code>hieradata</code>\u00a0in <code>main.tf</code>: <pre><code>selinux::mode: 'permissive'\n</code></pre></p>"},{"location":"#1012-create-a-compute-node-image","title":"10.12 Create a compute node image","text":"<p>When scaling the compute node pool, either manually by changing the count or automatically with Slurm autoscale, it can become beneficial to reduce the time spent configuring the machine when it boots for the first time, hence reducing the time requires before it becomes available in Slurm. One way to achieve this is to clone the root disk of a fully configured compute node and use it as the base image of future compute nodes.</p> <p>This process has three steps:</p> <ol> <li>Prepare the volume for image cloning</li> <li>Create the image</li> <li>Configure Magic Castle Terraform code to use the new image</li> </ol> <p>The following subsection explains how to accomplish each step.</p> <p>Warning: While it will work in most cases, avoid re-using the compute node image of a previous deployment. The preparation steps cleans most of the deployment specific configuration and secrets, but there is no guarantee that the configuration will be entirely compatible with a different deployment.</p>"},{"location":"#10121-prepare-the-volume-for-cloning","title":"10.12.1 Prepare the volume for cloning","text":"<p>The environment puppet-magic_castle installs a script that prepares the volume for cloning named <code>prepare4image.sh</code>.</p> <p>To make sure a node is ready for cloning, open its puppet agent log and validate the catalog was successfully applied at least once: <pre><code>journalctl -u puppet | grep \"Applied catalog\"\n</code></pre></p> <p>To prepare the volume for cloning, execute the following line while connected to the compute node: <pre><code>sudo /usr/sbin/prepare4image.sh\n</code></pre></p> <p>Be aware that, since it is preferable for the instance to be powered off when cloning its volume, the script halts the machine once it is completed. Therefore, after executing <code>prepare4image.sh</code>, you will be disconnected from the instance.</p> <p>The script <code>prepare4image.sh</code> executes the following steps in order:</p> <ol> <li>Stop and disable puppet agent</li> <li>Stop and disable slurm compute node daemon (<code>slurmd</code>)</li> <li>Stop and disable consul agent daemon</li> <li>Stop and disable consul-template daemon</li> <li>Unenroll the host from the IPA server</li> <li>Remove puppet agent configuration files in <code>/etc</code></li> <li>Remove consul agent identification files</li> <li>Unmount NFS directories</li> <li>Remove NFS directories <code>/etc/fstab</code></li> <li>Stop syslog</li> <li>Clear <code>/var/log/message</code> content</li> <li>Remove cloud-init's logs and artifacts so it can re-run</li> <li>Power off the machine</li> </ol>"},{"location":"#10122-create-the-image","title":"10.12.2 Create the image","text":"<p>Once the instance is powered off, access your cloud provider dashboard, find the instance and follow the provider's instructions to create the image.</p> <ul> <li>AWS</li> <li>Azure</li> <li>GCP</li> <li>OpenStack</li> <li>OVH</li> </ul> <p>Note down the name/id of the image you created, it will be needed during the next step.</p>"},{"location":"#10123-configure-magic-castle-terraform-code-to-use-the-new-image","title":"10.12.3 Configure Magic Castle Terraform code to use the new image","text":"<p>Edit your <code>main.tf</code> and add <code>image = \"name-or-id-of-your-image\"</code> to the dictionary defining the instance. The instance previously powered off will be powered on and future non-instantiated machines will use the image at the next execution of <code>terraform apply</code>.</p> <p>If the cluster is composed of heterogeneous compute nodes, it is possible to create an image for each type of compute nodes. Here is an example with Google Cloud <pre><code>instances = {\n  mgmt   = { type = \"n2-standard-2\", tags = [\"puppet\", \"mgmt\", \"nfs\"], count = 1 }\n  login  = { type = \"n2-standard-2\", tags = [\"login\", \"public\", \"proxy\"], count = 1 }\n  node   = {\n    type = \"n2-standard-2\"\n    tags = [\"node\", \"pool\"]\n    count = 10\n    image = \"rocky-mc-cpu-node\"\n  }\n  gpu    = {\n    type = \"n1-standard-2\"\n    tags = [\"node\", \"pool\"]\n    count = 10\n    gpu_type = \"nvidia-tesla-t4\"\n    gpu_count = 1\n    image = \"rocky-mc-gpu-node\"\n  }\n}\n</code></pre></p>"},{"location":"#1013-read-and-edit-secret-values-generated-at-boot","title":"10.13 Read and edit secret values generated at boot","text":"<p>During the cloud-init initialization phase, <code>bootstrap.sh</code> script is executed. This script generates a set of encrypted secret values that are required by the Magic Castle Puppet environment:</p> <ul> <li><code>profile::consul::acl_api_token</code></li> <li><code>profile::freeipa::mokey::password</code></li> <li><code>profile::freeipa::server::admin_password</code></li> <li><code>profile::freeipa::server::ds_password</code></li> <li><code>profile::slurm::accounting::password</code></li> <li><code>profile::slurm::base::munge_key</code></li> </ul> <p>To read or change the value of one of these keys, use <code>eyaml edit</code> command on the <code>puppet</code> host, like this: <pre><code>sudo /opt/puppetlabs/puppet/bin/eyaml edit \\\n  --pkcs7-private-key /etc/puppetlabs/puppet/eyaml/boot_private_key.pkcs7.pem \\\n  --pkcs7-public-key /etc/puppetlabs/puppet/eyaml/boot_public_key.pkcs7.pem \\\n  /etc/puppetlabs/code/environments/production/data/bootstrap.yaml\n</code></pre></p> <p>It is also possible to redefine the values of these keys by adding the key-value pair to the hieradata configuration file. Refer to section 4.13 hieradata. User defined values take precedence over boot generated values in the Magic Castle Puppet data hierarchy.</p>"},{"location":"#1014-expand-a-volume","title":"10.14 Expand a volume","text":"<p>Volumes defined in the <code>volumes</code> map can be expanded at will. To enable online extension of a volume, add <code>enable_resize = true</code> to its specs map. You can then increase the size at will. The corresponding volume will be expanded by the cloud provider and the filesystem will be extended by Puppet.</p>"},{"location":"#11-customize-magic-castle-terraform-files","title":"11. Customize Magic Castle Terraform Files","text":"<p>You can modify the Terraform module files in the folder named after your cloud provider (e.g: <code>gcp</code>, <code>openstack</code>, <code>aws</code>, etc.)</p>"},{"location":"design/","title":"Design","text":""},{"location":"design/#magic-castle-terraform-structure","title":"Magic Castle Terraform Structure","text":"<p>Figure 1 (below) illustrates how Magic Castle is structured to provide a unified interface between multiple cloud providers. Each blue block is a file or a module, while white blocks are variables or resources. Arrows indicate variables or resources that contribute to the definition of the linked variables or resources. The figure can be read as a flow-chart from top to bottom. Some resources and variables have been left out of the chart to avoid cluttering it further.</p> <p> Figure 1. Magic Castle Terraform Project Structure</p> <ol> <li><code>main.tf</code>: User provides the instances and volumes structure they wants as _map_s.     <pre><code>instances = {\n  mgmt  = { type = \"p4-7.5gb\", tags = [\"puppet\", \"mgmt\", \"nfs\"] }\n  login = { type = \"p2-3.75gb\", tags = [\"login\", \"public\", \"proxy\"] }\n  node  = { type = \"p2-3.75gb\", tags = [\"node\"], count = 2 }\n}\n\nvolumes = {\n  nfs = {\n    home     = { size = 100 }\n    project  = { size = 500 }\n    scratch  = { size = 500 }\n  }\n}\n</code></pre></li> <li> <p><code>common/design</code>: </p> <ol> <li>the <code>instances</code> map is expanded to form a new map where each entry represents a single host.     <pre><code>instances = {\n  mgmt1 = {\n    type = \"p2-3.75gb\"\n    tags = [\"puppet\", \"mgmt\", \"nfs\"]\n  }\n  login1 = {\n    type = \"p2-3.75gb\"\n    tags = [\"login\", \"public\", \"proxy\"]\n  }\n  node1 = {\n    type = \"p2-3.75gb\"\n    tags = [\"node\"]\n  }\n  node2 = {\n    type = \"p2-3.75gb\"\n    tags = [\"node\"]\n  }\n}\n</code></pre></li> <li>the <code>volumes</code> map is expanded to form a new map where each entry represent a single volume     <pre><code>volumes = {\n  mgmt1-nfs-home    = { size = 100 }\n  mgmt1-nfs-project = { size = 100 }\n  mgmt1-nfs-scratch = { size = 500 }\n}\n</code></pre></li> </ol> </li> <li> <p><code>network.tf</code>: the <code>instances</code> map from <code>common/design</code> is used to generate a network interface (nic) for each host, and a public ip address for each host with the <code>public</code> tag.     <pre><code>resource \"provider_network_interface\" \"nic\" {\n  for_each = module.design.instances\n  ...\n}\n</code></pre></p> </li> <li> <p><code>common/configuration</code>: for each host in <code>instances</code>, a cloud-init yaml config that includes <code>puppetservers</code> is generated. These configs are outputted to a <code>user_data</code> map where the keys are the hostnames.     <pre><code>user_data = {\n  for key, values in var.instances :\n    key =&gt; templatefile(\"${path.module}/puppet.yaml\", { ... })\n}\n</code></pre></p> </li> <li> <p><code>infrastructure.tf</code>: for each host in <code>instances</code>, an instance resource as defined by the selected cloud provider is generated. Each instance is initially configured by its <code>user_data</code> cloud-init yaml config.     <pre><code>resource \"provider_instance\" \"instances\" {\n  for_each  = module.design.instance\n  user_data = module.instance_config.user_data[each.key]\n  ...\n}\n</code></pre></p> </li> <li> <p><code>infrastructure.tf</code>: for each volume in <code>volumes</code>, a block device as defined by the selected cloud provider is generated and attached it to its matching instance using an <code>attachment</code> resource.     <pre><code>resource \"provider_volume\" \"volumes\" {\n  for_each = module.design.volumes\n  size     = each.value.size\n  ...\n}\nresource \"provider_attachment\" \"attachments\" {\n  for_each    = module.design.volumes\n  instance_id = provider_instance.instances[each.value.instance].id\n  volume_id   = provider_volume.volumes[each.key].id\n  ...\n}\n</code></pre></p> </li> <li> <p><code>infrastructure.tf</code>: the created instances' information are consolidated in a map named <code>inventory</code>.     <pre><code>inventory = {\n  mgmt1 = {\n    public_ip = \"\"\n    local_ip  = \"10.0.0.1\"\n    id        = \"abc1213-123-1231\"\n    tags      = [\"mgmt\", \"puppet\", \"nfs\"]\n  }\n  ...\n}\n</code></pre></p> </li> <li> <p><code>common/provision</code>: the information from created instances is consolidated and written in a yaml file named<code>terraform_data.yaml</code> that is uploaded on the Puppet server as part of the hieradata.     <pre><code>resource \"terraform_data\" \"deploy_puppetserver_files\" {\n  ...\n  provisioner \"file\" {\n    content     = var.terraform_data\n    destination = \"terraform_data.yaml\"\n  }\n  ...\n}\n</code></pre></p> </li> <li> <p><code>outputs.tf</code>: the information of all instances that have a public address are output as a map named <code>public_instances</code>.</p> </li> </ol>"},{"location":"design/#resource-per-provider","title":"Resource per provider","text":"<p>In the previous section, we have used generic resource name when writing HCL code that defines these resources. The following table indicate what resource is used for each provider based on its role in the cluster.</p> Resource AWS Azure Google Cloud Platform OpenStack OVH network aws_vpc azurerm_virtual_network google_compute_network prebuilt openstack_networking_network_v2 subnet aws_subnet azurerm_subnet google_compute_subnetwork prebuilt openstack_networking_subnet_v2 router aws_route not used google_compute_router built-in not used nat aws_internet_gateway not used google_compute_router_nat built-in not used firewall aws_security_group azurerm_network_security_group google_compute_firewall openstack_compute_secgroup_v2 openstack_compute_secgroup_v2 nic aws_network_interface azurerm_network_interface google_compute_address openstack_networking_port_v2 openstack_networking_port_v2 public ip aws_eip azurerm_public_ip google_compute_address openstack_networking_floatingip_v2 openstack_networking_network_v2 instance aws_instance azurerm_linux_virtual_machine google_compute_instance openstack_compute_instance_v2 openstack_compute_instance_v2 volume aws_ebs_volume azurerm_managed_disk google_compute_disk openstack_blockstorage_volume_v3 openstack_blockstorage_volume_v3 attachment aws_volume_attachment azurerm_virtual_machine_data_disk_attachment google_compute_attached_disk openstack_compute_volume_attach_v2 openstack_compute_volume_attach_v2"},{"location":"design/#using-reference-design-to-extend-for-a-new-cloud-provider","title":"Using reference design to extend for a new cloud provider","text":"<p>Magic Castle currently supports five cloud providers, but its design makes it easy to add new providers. This section presents a step-by-step guide to add a new cloud provider support to Magic Castle.</p> <ol> <li> <p>Identify the resources. Using the Resource per provider table, read the cloud provider Terraform documentation, and identify the name for each resource in the table.</p> </li> <li> <p>Check minimum requirements. Once all resources have been identified, you should be able to determine if the cloud provider can be used to deploy Magic Castle. If you found a name for each resource listed in table, the cloud provider can be supported. If some resources are missing, you will need to read the provider's documentation to determine if the absence of the resource can be compensated for somehow.</p> </li> <li> <p>Initialize the provider folder. Create a folder named after the provider. In this folder, create two symlinks, one pointing to <code>common/variables.tf</code> and the other to <code>common/outputs.tf</code>. These files define the interface common to all providers supported by Magic Castle.</p> </li> <li> <p>Define cloud provider specifics variables. Create a file named after your provider <code>provider_name.tf</code>\u00a0and define variables that are required by the provider but not common to all providers, for example the availability zone or the region. In this file, define two local variables named <code>cloud_provider</code> and <code>cloud_region</code>.</p> </li> <li> <p>Initialize the infrastructure. Create a file named  <code>infrastructure.tf</code>. In this file:</p> <ol> <li>define the provider block if it requires input parameters, i.e: var.region     <pre><code>provider \"provider_name\" {\n  region = var.region\n}\n</code></pre></li> <li>include the design module     <pre><code>module \"design\" {\n  source       = \"../common/design\"\n  cluster_name = var.cluster_name\n  domain       = var.domain\n  instances    = var.instances\n  pool         = var.pool\n  volumes      = var.volumes\n}\n</code></pre></li> </ol> </li> <li> <p>Create the networking infrastructure. Create a file named <code>network.tf</code> and define the network, subnet, router, nat, firewall, nic and public ip resources using the <code>module.design.instances</code> map.</p> </li> <li> <p>Create the volumes. In <code>infrastructure.tf</code>, define the <code>volumes</code> resource using <code>module.design.volumes</code>.</p> </li> <li> <p>Consolidate the instances' information.  In <code>infrastructure.tf</code>, define a local variable named <code>inventory</code> that will be a map containing the following keys for each instance: <code>public_ip</code>, <code>local_ip</code>, <code>prefix</code>, <code>tags</code>, and <code>specs</code> (#cpu, #gpus, ram, volumes). For the volumes, you need to provide the paths under which the volumes will be found on the instances to which they are attached. This is typically derived from the volume id. Here is an example:   <pre><code>volumes = contains(keys(module.design.volume_per_instance), x) ? {\n  for pv_key, pv_values in var.volumes:\n    pv_key =&gt; {\n      for name, specs in pv_values:\n        name =&gt; [\"/dev/disk/by-id/*${substr(provider.volumes[\"${x}-${pv_key}-${name}\"].id, 0, 20)}\"]\n    } if contains(values.tags, pv_key)\n  } : {}\n</code></pre></p> </li> <li> <p>Create the instance configurations. In <code>infrastructure.tf</code>, include the <code>common/configuration</code> module like this:     <pre><code>module \"configuration\" {\n  source                = \"../common/configuration\"\n  inventory             = local.inventory\n  config_git_url        = var.config_git_url\n  config_version        = var.config_version\n  sudoer_username       = var.sudoer_username\n  public_keys           = var.public_keys\n  domain_name           = module.design.domain_name\n  cluster_name          = var.cluster_name\n  guest_passwd          = var.guest_passwd\n  nb_users              = var.nb_users\n  software_stack        = var.software_stack\n  cloud_provider        = local.cloud_provider\n  cloud_region          = local.cloud_region\n}\n</code></pre></p> </li> <li> <p>Create the instances. In <code>infrastructure.tf</code>, define the <code>instances</code> resource using <code>module.design.instances_to_build</code> for the instance attributes and <code>module.configuration.user_data</code> for the initial configuration.</p> </li> <li> <p>Attach the volumes. In <code>infrastructure.tf</code>, define the <code>attachments</code> resource using <code>module.design.volumes</code> and refer to the attribute <code>each.value.instance</code> to retrieve the instance's id to which the volume needs to be attached.</p> </li> <li> <p>Identify the public instances. In <code>infrastructure.tf</code>, define a local variable named <code>public_instances</code> that contains the attributes of instances that are publicly accessible from Internet and their ids.   <pre><code>locals {\n  public_instances = { for host in keys(module.design.instances_to_build):\n    host =&gt; merge(module.configuration.inventory[host], {id=cloud_provider_instance_resource.instances[host].id})\n    if contains(module.configuration.inventory[host].tags, \"public\")\n  }\n}\n</code></pre></p> </li> <li> <p>Include the provision module to transmit Terraform data to the Puppet server. In <code>infrastructure.tf</code>, include the <code>common/provision</code> module like this   <pre><code>module \"provision\" {\n  source          = \"../common/provision\"\n  bastions        = local.public_instances\n  puppetservers   = module.configuration.puppetservers\n  tf_ssh_key      = module.configuration.ssh_key\n  terraform_data  = module.configuration.terraform_data\n  terraform_facts = module.configuration.terraform_facts\n  hieradata       = var.hieradata\n  sudoer_username = var.sudoer_username\n}\n</code></pre></p> </li> </ol>"},{"location":"design/#an-example","title":"An example","text":"<ol> <li> <p>Identify the resources. For Digital Ocean, Oracle Cloud and Alibaba Cloud, we get the following resource mapping:     | Resource    | Digital Ocean | Oracle Cloud | Alibaba Cloud |     | ----------- | :-------------------- |  :-------------------- |  :-------------------- |     | network     | digitalocean_vpc | oci_core_vcn | alicloud_vpc |     | subnet      | built in vpc | oci_subnet | alicloud_vswitch |     | router      | n/a          | oci_core_route_table | built in vpc |     | nat         | n/a          | oci_core_internet_gateway | alicloud_nat_gateway |     | firewall    | digitalocean_firewall | oci_core_security_list | alicloud_security_group |     | nic         | n/a | built in instance | alicloud_network_interface |     | public ip   | digitalocean_floating_ip | built in instance | alicloud_eip |     | instance    | digitalocean_droplet | oci_core_instance | alicloud_instance |     | volume      | digitalocean_volume | oci_core_volume | alicloud_disk |     | attachment  | digitalocean_volume_attachment | oci_core_volume_attachment | alicloud_disk_attachment |</p> </li> <li> <p>Check minimum requirements. In the preceding table, we can see Digital Ocean does not have the ability to define a network interface. The documentation also leads us to conclude that it is not possible to define the private ip address of the instances before creating them. Because the Puppet server ip address is required before generating the cloud-init YAML config for all instances, including the Puppet server itself, this means it impossible to use Digital Ocean to spawn a Magic Castle cluster.  Oracle Cloud presents the same issue, however, after reading the instance documentation, we find that it is possible to define a static ip address as a string in the instance attribute. It would therefore be possible to create a datastructure in Terraform that would associate each instance hostname with an ip address in the subnet CIDR.  Alibaba cloud has an answer for each resource, so we will use this provider in the following steps.</p> </li> <li> <p>Initialize the provider folder. In a terminal:   <pre><code>git clone https://github.com/ComputeCanada/magic_castle.git\ncd magic_castle\nmkdir alicloud\ncd aliclcoud\nln -s ../common/{variables,outputs}.tf .\n</code></pre></p> </li> <li> <p>Define cloud provider specifics variables. Add the following to a new file <code>alicloud.tf</code>:   <pre><code>variable \"region\" { }\nlocals {\n  cloud_provider  = \"alicloud\"\n  cloud_region    = var.region\n}\n</code></pre></p> </li> <li> <p>Initialize the infrastructure. Add the following to a new file <code>infrastructure.tf</code>:   <pre><code>provider \"alicloud\" {\n  region = var.region\n}\n\nmodule \"design\" {\n  source       = \"../common/design\"\n  cluster_name = var.cluster_name\n  domain       = var.domain\n  instances    = var.instances\n  pool         = var.pool\n  volumes      = var.volumes\n}\n</code></pre></p> </li> <li> <p>Create the networking infrastructure. <code>network.tf</code> base template:   <pre><code>resource \"alicloud_vpc\" \"network\" { }\nresource \"alicloud_vswitch\" \"subnet\" { }\nresource \"alicloud_nat_gateway\" \"nat\" { }\nresource \"alicloud_security_group\" \"firewall\" { }\nresource \"alicloud_security_group_rule\" \"allow_in_services\" { }\nresource \"alicloud_security_group\" \"allow_any_inside_vpc\" { }\nresource \"alicloud_security_group_rule\" \"allow_ingress_inside_vpc\" { }\nresource \"alicloud_security_group_rule\" \"allow_egress_inside_vpc\" { }\nresource \"alicloud_network_interface\" \"nic\" { }\nresource \"alicloud_eip\" \"public_ip\" { }\nresource \"alicloud_eip_association\" \"eip_asso\" { }\n</code></pre></p> </li> <li> <p>Create the volumes. Add and complete the following snippet to <code>infrastructure.tf</code>:   <pre><code>resource \"alicloud_disk\" \"volumes\" {\n  for_each = module.design.volumes\n}\n</code></pre></p> </li> <li> <p>Consolidate the instances' information. Add the following snippet to <code>infrastructure.tf</code>:   <pre><code>locals {\n  inventory = { for x, values in module.design.instances :\n    x =&gt; {\n      public_ip   = contains(values[\"tags\"], \"public\") ? alicloud_eip.public_ip[x].public_ip : \"\"\n      local_ip    = alicloud_network_interface.nic[x].private_ip\n      tags        = values[\"tags\"]\n      id          = alicloud_instance.instances[x].id\n      specs       = {\n        cpus = ...\n        gpus = ...\n        ram = ...\n        volumes = contains(keys(module.design.volume_per_instance), x) ? {\n          for pv_key, pv_values in var.volumes:\n            pv_key =&gt; {\n              for name, specs in pv_values:\n                name =&gt; [\"/dev/disk/by-id/virtio-${replace(alicloud_disk.volumes[\"${x}-${pv_key}-${name}\"].id, \"d-\", \"\")}\"]\n            } if contains(values.tags, pv_key)\n          } : {}\n      }\n    }\n  }\n}\n</code></pre></p> </li> <li> <p>Create the instance configurations. In <code>infrastructure.tf</code>, include the <code>common/configuration</code> module like this:     <pre><code>module \"configuration\" {\n  source                = \"../common/configuration\"\n  inventory             = local.inventory\n  config_git_url        = var.config_git_url\n  config_version        = var.config_version\n  sudoer_username       = var.sudoer_username\n  public_keys           = var.public_keys\n  domain_name           = module.design.domain_name\n  cluster_name          = var.cluster_name\n  guest_passwd          = var.guest_passwd\n  nb_users              = var.nb_users\n  software_stack        = var.software_stack\n  cloud_provider        = local.cloud_provider\n  cloud_region          = local.cloud_region\n}\n</code></pre></p> </li> <li> <p>Create the instances. Add and complete the following snippet to <code>infrastructure.tf</code>:   <pre><code>resource \"alicloud_instance\" \"instances\" {\n  for_each = module.design.instances\n}\n</code></pre></p> </li> <li> <p>Attach the volumes. Add and complete the following snippet to <code>infrastructure.tf</code>:   <pre><code>resource \"alicloud_disk_attachment\" \"attachments\" {\n  for_each = module.design.volumes\n}\n</code></pre></p> </li> <li> <p>Identify the public instances. In <code>infrastructure.tf</code>, define a local variable named <code>public_instances</code> that contains the attributes of instances that are publicly accessible from Internet and their ids.   <pre><code>locals {\n  public_instances = { for host in keys(module.design.instances_to_build):\n    host =&gt; merge(module.configuration.inventory[host], {id=alicloud_instance.instances[host].id})\n    if contains(module.configuration.inventory[host].tags, \"public\")\n  }\n}\n</code></pre></p> </li> <li> <p>Include the provision module to transmit Terraform data to the Puppet server. In <code>infrastructure.tf</code>, include the <code>common/provision</code> module like this   <pre><code>module \"provision\" {\n  source          = \"../common/provision\"\n  bastions        = local.public_instances\n  puppetservers   = module.configuration.puppetservers\n  tf_ssh_key      = module.configuration.ssh_key\n  terraform_data  = module.configuration.terraform_data\n  terraform_facts = module.configuration.terraform_facts\n  hieradata       = var.hieradata\n}\n</code></pre></p> </li> </ol> <p>Once your new provider is written, you can write an example that will use the module to spawn a Magic Castle cluster with that provider.   <pre><code>module \"alicloud\" {\n  source         = \"./alicloud\"\n  config_git_url = \"https://github.com/ComputeCanada/puppet-magic_castle.git\"\n  config_version = \"main\"\n\n  cluster_name = \"new\"\n  domain       = \"my.cloud\"\n  image        = \"centos_7_9_x64_20G_alibase_20210318.vhd\"\n  nb_users     = 10\n\n  instances = {\n    mgmt   = { type = \"ecs.g6.large\", tags = [\"puppet\", \"mgmt\", \"nfs\"] }\n    login  = { type = \"ecs.g6.large\", tags = [\"login\", \"public\", \"proxy\"] }\n    node   = { type = \"ecs.g6.large\", tags = [\"node\"], count = 1 }\n  }\n\n  volumes = {\n    nfs = {\n      home     = { size = 10 }\n      project  = { size = 50 }\n      scratch  = { size = 50 }\n    }\n  }\n\n  public_keys = [file(\"~/.ssh/id_rsa.pub\")]\n\n  # Alicloud specifics\n  region  = \"us-west-1\"\n}\n</code></pre></p>"},{"location":"developers/","title":"Magic Castle Developer Documentation","text":""},{"location":"developers/#table-of-content","title":"Table of Content","text":"<ol> <li>Setup</li> <li>Where to start</li> <li>Puppet environment</li> <li>Troubleshooting</li> <li>Release</li> </ol>"},{"location":"developers/#1-setup","title":"1. Setup","text":"<p>To develop for Magic Castle you will need: * Terraform (&gt;= 1.4.0) * git * Access to a Cloud (e.g.: Compute Canada Arbutus) * Ability to communicate with the cloud provider API from your computer * A cloud project with enough room for the resource described in section Magic Caslte Doc 1.1. * [optional] Puppet Development Kit (PDK)</p>"},{"location":"developers/#2-where-to-start","title":"2. Where to start","text":"<p>The Magic Castle project is defined by Terraform infrastructure-as-code component that is responsible of generating a cluster architecture in a cloud and a Puppet environment component that configures the cluster instances based on their role.</p> <p>If you wish to add device, an instance, add a new networking interface or a filesystem, you will most likely need to develop some Terraform code. The project structure for Terraform code is described in the reference design document. The document also describes how one could work with current Magic Castle code to add support for another cloud provider.</p> <p>If you wish to add a service to one of the Puppet environments, install a new software, modify an instance configuration or role, you will most likely need to develop some Puppet code. The following section provides more details on the Puppet environments available and how to develop them.</p>"},{"location":"developers/#3-puppet-environment","title":"3. Puppet environment","text":"<p>Magic Castle Terraform code initialized every instances to be a Puppet agent and an instance with the tag <code>puppet</code> as the Puppet main server. On the Puppet main server, there is a folder containing the configuration code for the instances of the cluster, this folder is called a Puppet environment and it is pulled from GitHub during the initial configuration of the Puppet main server.</p> <p>The source of that environment is provided to Terraform using the variable <code>config_git_url</code>.</p> <p>A repository describing a Magic Castle Puppet environment must contain at the least the following files and folders: <pre><code>config_git_repo\n\u2523 Puppetfile\n\u2523 environment.conf\n\u2523 hiera.yaml\n\u2517 data\n  \u2517 common.yaml\n\u2517 manifests/\n  \u2517 site.pp\n</code></pre></p> <ul> <li><code>Puppetfile</code> specifies the Puppet modules that need to be installed in the environment.</li> <li><code>environment.conf</code> overrides the primary server default settings for the environment.</li> <li><code>hiera.yaml</code> configures an ordered list of YAML file data sources.</li> <li><code>data/common.yaml</code> is common data source for the instances part of hierarchy defined by <code>hiera.yaml</code>.</li> <li><code>manifests/site.pp</code> defines how each instance will be configured based on their hostname and/or tags.</li> </ul> <p>An example of a bare-bone Magic Castle Puppet environment is available on GitHub: MagicCastle/puppet-environment, while the Puppet environment that replicates a Compute Canada HPC cluster is named ComputeCanada/puppet-magic_castle.</p>"},{"location":"developers/#terraform_datayaml-a-bridge-between-terraform-and-puppet","title":"terraform_data.yaml: a bridge between Terraform and Puppet","text":"<p>To provide information on the deployed resources and the value of the input parameters, Magic Castle Terraform code uploads to the Puppet main server a file named <code>terraform_data.yaml</code>, in the folder <code>/etc/puppetlabs/data/</code>. There is also a symlink created in <code>/etc/puppetlabs/code/environment/production/data/</code> to ease its usage inside the Puppet environment.</p> <p>When included in the data hierarchy (<code>hiera.yaml</code>), <code>terraform_data.yaml</code> can provide information about the instances, the volumes and the variables set by the user through the <code>main.tf</code> file. The file has the following structure: <pre><code>---\nterraform:\n  data:\n    cluster_name: \"\"\n    domain_name: \"\"\n    guest_passwd: \"\"\n    nb_users: \"\"\n    public_keys: []\n    sudoer_username: \"\"\n  instances:\n    host1:\n      hostkeys:\n        rsa: \"\"\n        ed25519: \"\"\n      local_ip: \"x.x.x.x\"\n      prefix: \"host\"\n      public_ip: \"\"\n      specs:\n        \"cpus\": 0\n        \"gpus\": 0\n        \"ram\": 0\n      tags:\n        - \"tag_1\"\n        - \"tag_2\"\n  tag_ip:\n    tag_1:\n      - x.x.x.x\n    tag_2:\n      - x.x.x.x\n  volumes:\n    volume_tag1:\n      volume_1:\n        - \"/dev/disk/by-id/123-*\"\n      volume_2:\n        - \"/dev/disk/by-id/123-abc-*\"\n</code></pre></p> <p>The values provided by <code>terraform_data.yaml</code> can be accessed in Puppet by using the <code>lookup()</code> function. For example, to access an instance's list of tags: <pre><code>lookup(\"terraform.instances.${::hostname}.tags\")\n</code></pre> The data source can also be used to define a key in another data source YAML file by using the <code>alias()</code> function. For example, to define the number of guest accounts using the value of <code>nb_users</code>, we could add this to <code>common.yaml</code> <pre><code>profile::accounts::guests::nb_accounts: \"%{alias('terraform.data.nb_users')}\"\n</code></pre></p>"},{"location":"developers/#configuring-instances-sitepp-and-classes","title":"Configuring instances: site.pp and classes","text":"<p>The configuration of each instance is defined in <code>manifests/site.pp</code> file of the Puppet environment. In this file, it is possible to define a configuration based on an instance hostname <pre><code>node \"mgmt1\" { }\n</code></pre> or using the instance tags by defining the configuration for the <code>default</code> node : <pre><code>node default {\n  $instance_tags = lookup(\"terraform.instances.${::hostname}.tags\")\n  if 'tag_1' in $instances_tags { }\n}\n</code></pre></p> <p>It is possible to define Puppet resource directly in <code>site.pp</code>. However, above a certain level of complexity, which can be reach fairly quickly, it is preferable to define classes and include these classes in <code>site.pp</code> based on the node hostname or tags.</p> <p>Classes can be defined in the Puppet environment under the following path: <code>site/profile/manifests</code>. These classes are named profile classes and the philosophy behind it is explained in Puppet documentation. Because these classes are defined in <code>site/profile</code>, their name has to start with the prefix <code>profile::</code>.</p> <p>It is also possible to include classes defined externally and installed using the <code>Puppetfile</code>. These classes installed by r10k can be found in the <code>modules</code> folder of the Puppet environment.</p>"},{"location":"developers/#4-troubleshooting","title":"4. Troubleshooting","text":""},{"location":"developers/#41-cloud-init","title":"4.1 cloud-init","text":"<p>To test new additions to <code>puppet.yaml</code>, it is possible to execute cloud-init phases manually. There are four steps that can be executed sequentially: init local, init modules config and modules final. Here are the corresponding commands to execute each step: <pre><code>cloud-init init --local\ncloud-init init\ncloud-init modules --mode=config\ncloud-init modules --mode=final\n</code></pre></p> <p>It is also possible to clean a cloud-init execution and have it execute again at next reboot. To do so, enter the following command: <pre><code>cloud-init clean\n</code></pre> Add <code>-r</code> to the previous command to reboot the instance once cloud-init has finishing cleaning.</p>"},{"location":"developers/#42-selinux","title":"4.2 SELinux","text":"<p>SELinux is enabled on every instances of a Magic Castle cluster. Some applications do not provide SELinux policies which can lead to their malfunctionning when SELinux is enabled. It is possible to track down the reasons why SELinux is preventing an application to work properly using the command-line tool <code>ausearch</code>.</p> <p>If you suspect application <code>app-a</code> to be denied by SELinux to work properly, run the following command as root: <pre><code>ausearch -c app-a --raw | grep denied\n</code></pre></p> <p>To see all requests denied by SELinux: <pre><code>ausearch --raw | grep denied\n</code></pre></p> <p>Sometime, the denials are hidden from regular logging. To display all denials, run the following command as root: <pre><code>semodule --disable_dontaudit --build\n</code></pre> then re-execute the application that is not working properly.</p> <p>Once you have found the denials that are the cause of the problem, you can create a new policy to allow the requests that were previously denied with the following command: <pre><code>ausearch -c app-a --raw | grep denied | audit2allow -a -M app-a\n</code></pre></p> <p>Finally, you can install the generated policy using the command provided by <code>auditallow</code>.</p>"},{"location":"developers/#building-the-policy-package-file-pp-from-the-enforcement-file-te","title":"Building the policy package file (.pp) from the enforcement file (.te)","text":"<p>If you need to tweak an existing enforcement file and you want to recompile the policy package, you can with the following commands: <pre><code>checkmodule -M -m -o my_policy.mod my_policy.te\nsemodule_package -o my_policy.pp -m my_policy.mod\n</code></pre></p>"},{"location":"developers/#references","title":"References","text":"<ul> <li>https://wiki.gentoo.org/wiki/SELinux</li> <li>https://wiki.gentoo.org/wiki/SELinux/Tutorials/Where_to_find_SELinux_permission_denial_details</li> </ul>"},{"location":"developers/#5-release","title":"5. Release","text":"<p>To build a release, use the script <code>release.sh</code> located at the root of Magic Castle git repo. <pre><code>Usage: release.sh VERSION [provider ...]\n</code></pre> The script creates a folder named <code>releases</code> where it was called.</p> <p>The <code>VERSION</code> argument is expected to correspond to git tag in the <code>puppet-magic_castle</code> repo. It could also be a branch name or a commit. If the provider optional argument is left blank, release files will be built for all providers currently supported by Magic Castle.</p> <p>Examples:</p> <ul> <li>Building a release for OpenStack with the puppet repo main branch:     <pre><code>$ ./release.sh main openstack\n</code></pre></li> <li>Building a release for GCP with the latest Terraform and cloud-init, and version 5.8 of puppet Magic Castle:     <code>$ ./release.sh 5.8 gcp</code></li> <li>Building a release for Azure and OVH with the latest Terraform and cloud-init, and version 5.7 of puppet Magic Castle:     <pre><code>$ ./release.sh 5.7 azure ovh\n</code></pre></li> </ul>"},{"location":"matrix/","title":"Comparison of Cloud HPC Cluster Projects","text":"Name Creator First public release date Software license AWS ParallelCluster AWS November 12, 2018 Apache License v2 Azure CycleCloud Microsoft October 17, 2018 MIT License Azure HPC On-Demand Platform Microsoft April 23, 2021 MIT License Cluster in the Cloud Matt Williams  - University of Bristol March 27, 2019 MIT License ElastiCluster Riccardo Murri - University of Zurich July 17, 2013 GPLv3 Google HPC-Toolkit Google May 26, 2022 Apache License v2 Magic Castle F\u00e9lix-Antoine Fortin - Compute Canada August 26, 2019 MIT License On-Demand Data Centre Adaptive Computing - - Slurm on GCP SchedMD March 14, 2018 Apache License v2"},{"location":"matrix/#supported-cloud-providers","title":"Supported cloud providers","text":"Name Alibaba Cloud AWS Azure Google Cloud IBM Cloud OpenStack Oracle Cloud OVH\u00a0 AWS ParallelCluster no yes no no no no no no Azure CycleCloud no no yes no no no no no Azure HPC On-Demand Platform no no yes no no no no no Cluster-in-the-Cloud no yes no yes no no yes no ElastiCluster* no yes yes yes no yes no - Google HPC-Toolkit no no no yes no no no no Magic Castle* no yes yes yes no yes no yes On-Demand Data Centre yes yes yes yes no no yes no Slurm on GCP no no no yes no no no no <p>* The documentation provides instructions on how to add support for other cloud providers.</p>"},{"location":"matrix/#supported-operating-systems","title":"Supported operating systems","text":"Name CentOS 7 CentOS 8 Rocky Linux 8 AlmaLinux 8 Debian 10 Ubuntu 18 Ubuntu 20 Windows 10 AWS ParallelCluster yes yes yes yes yes no yes no Azure CycleCloud yes yes yes yes yes no yes - Azure HPC On-Demand Platform yes no no yes no yes no yes Google HPC-Toolkit yes no no no no no no no Cluster in the Cloud no yes no no no no no no ElastiCluster yes yes yes yes no no no no Magic Castle no yes yes yes no no no no On-Demand Data Centre - - - - - - - - Slurm on GCP yes no yes no yes no yes no"},{"location":"matrix/#supported-job-schedulers","title":"Supported job schedulers","text":"Name AwsBatch Grid Engine HTCondor Moab Open PBS PBS Pro Slurm AWS ParallelCluster yes no no no no no yes Azure CycleCloud no yes yes no no yes yes Azure HPC On-Demand Platform no no no no yes no yes Google HPC-Toolkit no no no no no no yes Cluster in the Cloud no no no no no no yes ElastiCluster no yes no no no no yes Magic Castle no no no no no no yes On-Demand Data Centre no no no yes no no no Slurm on GCP no no no no no no yes"},{"location":"matrix/#technologies","title":"Technologies","text":"Name Infrastructure configuration Programming languages Configuration management Scientific software AWS ParallelCluster CLI generating YAML Python Chef Spack Azure CycleCloud WebUI or CLI + templates Python Chef Bring your own Azure HPC On-Demand Platform YAML files + shell scripts Shell, Terraform Ansible, Packer CVMFS Cluster in the Cloud CLI generating Terraform code Python, Terraform Ansible, Packer EESSI ElastiCluster CLI interpreting an INI file Python, Shell Ansible Bring your own Google HPC-Toolkit CLI generating Terraform code Go, Terraform Ansible, Packer Spack Magic Castle Terraform modules Terraform Puppet CC-CVMFS, EESSI On-Demand Data Centre - - - - Slurm GCP Terraform modules Terraform Ansible, Packer Spack"},{"location":"sequence/","title":"Magic Castle Sequence Diagrams","text":"<p>The following sequence diagrams illustrate the inner working of Magic Castle once <code>terraform apply</code> is called. Some details were left out of the diagrams, but every diagram is followed by references to the code files that were used to build it.</p>"},{"location":"sequence/#1-cluster-creation","title":"1. Cluster creation","text":""},{"location":"sequence/#notes","title":"Notes","text":"<ol> <li><code>puppet-magic_castle.git</code> does not have to refer to <code>ComputeCanada/puppet-magic_castle.git</code> repo. Users can use their own fork. See the developer documentation for more details.</li> </ol>"},{"location":"sequence/#references","title":"References","text":"<ul> <li><code>magic_castle:/common/design/main.tf</code></li> <li><code>magic_castle:/openstack/network-1.tf</code></li> <li><code>magic_castle:/openstack/network-2.tf</code></li> <li><code>magic_castle:/common/configuration/main.tf</code></li> <li><code>magic_castle:/openstack/infrastructure.tf</code></li> <li><code>magic_castle:/common/provision/main.tf</code></li> <li><code>magic_castle:/dns/cloudflare/main.tf</code></li> </ul>"},{"location":"sequence/#2-configuration-with-cloud-init","title":"2. Configuration with cloud-init","text":""},{"location":"sequence/#notes_1","title":"Notes","text":"<ol> <li><code>config_git_url repo</code> does not have to refer to <code>ComputeCanada/puppet-magic_castle.git</code> repo. Users can use their own fork. See the developer documentation for more details.</li> <li>While the diagram represents each step as completed sequentially, each node provisioning is independent. The only step that requires synchronisation between nodes and the management node is the puppet certificate generation.</li> </ol>"},{"location":"sequence/#references_1","title":"References","text":"<ul> <li><code>magic_castle:/common/configuration/puppet.yaml</code></li> </ul>"},{"location":"sequence/#3-configuration-with-puppet","title":"3. Configuration with Puppet","text":""},{"location":"sequence/#references_2","title":"References","text":"<ul> <li><code>puppet-magic_castle:/manifests/site.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/base.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/consul.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/freeipa.pp</code></li> </ul>"},{"location":"sequence/#4-configuration-with-consul-and-consul-template","title":"4. Configuration with Consul and Consul Template","text":""},{"location":"sequence/#references_3","title":"References","text":"<ul> <li><code>puppet-magic_castle:/profile/manifests/consul.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/cvmfs.pp</code></li> <li><code>puppet-magic_castle:/profile/manifests/slurm.pp</code></li> </ul>"},{"location":"terraform_cloud/","title":"Terraform Cloud","text":"<p>This document explains how to use Magic Castle with Terraform Cloud.</p>"},{"location":"terraform_cloud/#what-is-terraform-cloud","title":"What is Terraform Cloud?","text":"<p>Terraform Cloud is HashiCorp\u2019s managed service that allows to provision infrastructure using a web browser or a REST API instead of the command-line. This also means that the provisioned infrastructure parameters can be modified by a team and the state is stored in the cloud instead of a local machine.</p> <p>When provisioning in commercial cloud, Terraform Cloud can also provide a cost estimate of the resources.</p>"},{"location":"terraform_cloud/#getting-started-with-terraform-cloud","title":"Getting started with Terraform Cloud","text":"<ol> <li>Create a Terraform Cloud account</li> <li>Create an organization, join one or choose one available to you</li> </ol>"},{"location":"terraform_cloud/#managing-a-magic-castle-cluster-with-terraform-cloud","title":"Managing a Magic Castle cluster with Terraform Cloud","text":""},{"location":"terraform_cloud/#creating-the-workspace","title":"Creating the workspace","text":"<ol> <li>Create a git repository in GitHub, GitLab, or any of the version control system provider supported by Terraform Cloud</li> <li>In this git repository, add a copy of the Magic Castle example <code>main.tf</code> available for the cloud of your choice</li> <li>Log in Terraform Cloud account</li> <li>Create a new workspace<ol> <li>Choose Type: \"Version control workflow\"</li> <li>Connect to VCS: choose the version control provider that hosts your repository</li> <li>Choose the repository that contains your <code>main.tf</code></li> <li>Configure settings: tweak the name and description to your liking</li> <li>Click on \"Create workspace\"</li> </ol> </li> </ol> <p>You will be redirected automatically to your new workspace.</p>"},{"location":"terraform_cloud/#providing-cloud-provider-credentials-to-terraform-cloud","title":"Providing cloud provider credentials to Terraform Cloud","text":"<p>Terraform Cloud will invoke Terraform command-line in a remote virtual environment. For the CLI to be able to communicate with your cloud provider API, we need to define environment variables that Terraform will use to authenticate. The next sections explain which environment variables to define for each cloud provider and how to retrieve the values of the variable from the provider.</p> <p>If you plan on using these environment variables with multiple workspaces, it is recommended to create a credential variable set in Terraform Cloud.</p>"},{"location":"terraform_cloud/#aws","title":"AWS","text":"<p>You need to define these environment variables: - <code>AWS_ACCESS_KEY_ID</code> - <code>AWS_SECRET_ACCESS_KEY</code> (sensitive)</p> <p>The value of these variables can either correspond to the value of access key created on the AWS Security Credentials - Access keys page, or you can add user dedicated to Terraform Cloud in AWS IAM Users, and use its access key.</p>"},{"location":"terraform_cloud/#azure","title":"Azure","text":"<p>You need to define these environment variables: - <code>ARM_CLIENT_ID</code> - <code>ARM_CLIENT_SECRET</code> (sensitive) - <code>ARM_SUBSCRIPTION_ID</code> - <code>ARM_TENANT_ID</code></p> <p>Refer to Terraform Azure Provider - Creating a Service Principal to know how to create a Service Principal and retrieve the values for these environment variables.</p>"},{"location":"terraform_cloud/#google-cloud","title":"Google Cloud","text":"<p>You need to define this environment variable: - <code>GOOGLE_CLOUD_KEYFILE_JSON</code> (sensitive)</p> <p>The value of the variable will be the content of a Google Cloud service account JSON key file expressed a single line string. Example: <pre><code>{\"type\": \"service_account\",\"project_id\": \"project-id-1234\",\"private_key_id\": \"abcd1234\",...}\n</code></pre></p> <p>You can use <code>jq</code> to format the string from the JSON file provided by Google: <pre><code>jq . -c project-name-123456-abcdefjg.json\n</code></pre></p>"},{"location":"terraform_cloud/#openstack-ovh","title":"OpenStack / OVH","text":"<p>You need to define these environment variables: - <code>OS_AUTH_URL</code> - <code>OS_PROJECT_ID</code> - <code>OS_REGION_NAME</code> - <code>OS_INTERFACE</code> - <code>OS_IDENTITY_API_VERSION</code> - <code>OS_USER_DOMAIN_NAME</code> - <code>OS_USERNAME</code> - <code>OS_PASSWORD</code> (sensitive)</p> <p>Apart from <code>OS_PASSWORD</code>, the values for these variables are available in OpenStack RC file provided for your project.</p> <p>If you prefer to use OpenStack application credentials, you need to define at least these variables: - <code>OS_AUTH_TYPE</code>\u00a0 - <code>OS_AUTH_URL</code> - <code>OS_APPLICATION_CREDENTIAL_ID</code> - <code>OS_APPLICATION_CREDENTIAL_SECRET</code></p> <p>and potentially these too: - <code>OS_IDENTITY_API_VERSION</code>\u00a0 - <code>OS_REGION_NAME</code> - <code>OS_INTERFACE</code></p> <p>The values for these variables are available in OpenStack RC file provided when creating the application credentials.</p>"},{"location":"terraform_cloud/#providing-dns-provider-credentials-to-terraform-cloud","title":"Providing DNS provider credentials to Terraform Cloud","text":"<p>Terraform Cloud will invoke Terraform command-line in a remote virtual environment. For the CLI to be able to communicate with your DNS provider API, we need to define environment variables that Terraform will use to authenticate. The next sections explain which environment variables to define for each DNS provider and how to retrieve the values of the variable from the provider.</p>"},{"location":"terraform_cloud/#cloudflare","title":"CloudFlare","text":"<p>Refer to DNS - CloudFlare section of Magic Castle main documentation to determine which environment variables needs to be set.</p>"},{"location":"terraform_cloud/#google-cloud-dns","title":"Google Cloud DNS","text":"<p>Refer to DNS - Google Cloud section of Magic Castle main documentation to determine which environment variables needs to be set.</p>"},{"location":"terraform_cloud/#managing-magic-castle-variables-with-terraform-cloud-ui","title":"Managing Magic Castle variables with Terraform Cloud UI","text":"<p>It is possible to use Terraform Cloud web interface to define variable values in your <code>main.tf</code>. For example, you could want to define a guest password without writing it directly in <code>main.tf</code> to avoid displaying publicly.</p> <p>To manage a variable with Terraform Cloud: 1. edit your <code>main.tf</code> to define the variables you want to manage. In the following example, we want to manage the number of nodes and the guest password.</p> <pre><code>Add the variables at the beginning of the `main.tf`:\n  ```hcl\n  variable \"nb_nodes\" {}\n  variable \"password\" {}\n  ```\n\nThen replace the static value by the variable in our `main.tf`,\n\ncompute node count\n  ```hcl\n  node = { type = \"p2-3gb\", tags = [\"node\"], count = var.nb_nodes }\n  ```\nguest password\n  ```hcl\n  guest_passwd = var.password\n  ```\n</code></pre> <ol> <li>Commit and push this changes to your git repository.</li> <li>In Terraform Cloud workspace associated with that repository, go in \"Variables.</li> <li>Under \"Terraform Variables\", click the \"Add variable\" button and create a variable for each one defined previously in the\u00a0<code>main.tf</code>. Check \"Sensitive\" if the variable content should not never be shown in the UI or the API.</li> </ol> <p>You may edit the variables at any point of your cluster lifetime.</p>"},{"location":"terraform_cloud/#applying-changes","title":"Applying changes","text":"<p>To create your cluster, apply changes made to your <code>main.tf</code> or the variables, you will need to queue a plan. When you push to the default branch of the linked git repository, a plan will be automatically created. You can also create a plan manually. To do so, click on the \"Queue plan manually\" button inside your workspace, then \"Queue plan\".</p> <p>Once the plan has been successfully created, you can apply it using the \"Runs\" section. Click on the latest queued plan, then on the \"Apply plan\" button at the bottom of the plan page.</p>"},{"location":"terraform_cloud/#auto-apply","title":"Auto apply","text":"<p>It is possible to apply automatically a successful plan. Go in the \"Settings\" section, and under \"Apply method\" select \"Auto apply\". Any following successful plan will then be automatically applied.</p>"},{"location":"terraform_cloud/#magic-castle-terraform-cloud-and-the-cli","title":"Magic Castle, Terraform Cloud and the CLI","text":"<p>Terraform cloud only allows to apply or destroy the plan as stated in the main.tf, but sometimes it can be useful to run some other terraform commands that are only available through the command-line interface, for example <code>terraform taint</code>.</p> <p>It is possible to import the terraform state of a cluster on your local computer and then use the CLI on it.</p> <ol> <li> <p>Log in Terraform cloud: <pre><code>terraform login\n</code></pre></p> </li> <li> <p>Create a folder where the terraform state will be stored: <pre><code>mkdir my-cluster-1\n</code></pre></p> </li> <li> <p>Create a file named <code>cloud.tf</code> with the following content in your cluster folder: <pre><code>terraform {\n  cloud {\n    organization = \"REPLACE-BY-YOUR-TF-CLOUD-ORG\"\n    workspaces {\n      name = \"REPLACE-BY-THE-NAME-OF-YOUR-WORKSPACE\"\n    }\n  }\n}\n</code></pre> replace the values of <code>organization</code> and <code>name</code> with the appropriate value for your cluster.</p> </li> <li> <p>Initialize the folder and retrieve the state: <pre><code>terraform init\n</code></pre></p> </li> </ol> <p>To confirm the workspace has been properly imported locally, you can list the resources using: <pre><code>terraform state list\n</code></pre></p>"},{"location":"terraform_cloud/#enable-magic-castle-autoscaling","title":"Enable Magic Castle Autoscaling","text":"<p>Magic Castle in combination with Terraform Cloud (TFE) can be configured to give Slurm the ability to create and destroy instances based on the job queue content.</p> <p>To enable this feature: 1. Create a TFE API Token and save it somewhere safe.</p> <pre><code>1.1. If you subscribe to Terraform Cloud Team &amp; Governance plan, you can generate\na [Team API Token](https://www.terraform.io/cloud-docs/users-teams-organizations/api-tokens#team-api-tokens).\nThe team associated with this token requires no access to organization and can be secret.\nIt does not have to include any member. Team API token is preferable as its permissions can be\nrestricted to the minimum required for autoscale purpose.\n</code></pre> <ol> <li> <p>Create a workspace in TFE</p> <p>2.1. Make sure the repo is private as it will contain the API token.</p> <p>2.2. If you generated a Team API Token in 1, provide access to the workspace to the team:</p> <ol> <li>Workspace Settings -&gt; Team Access -&gt; Add team and permissions</li> <li>Select the team</li> <li>Click on \"Customize permissions for this team\"</li> <li>Under \"Runs\" select \"Apply\"</li> <li>Under \"Variables\" select \"Read and write\"</li> <li>Leave the rest as is and click on \"Assign custom permissions\"</li> </ol> <p>2.3 In Configure settings, under Advanced options, for Apply method, select Auto apply.</p> </li> <li> <p>Create the environment variables of the cloud provider credentials in TFE</p> </li> <li>Create a variable named <code>pool</code> in TFE. Set value to <code>[]</code> and check HCL.</li> <li>Add a file named <code>data.yaml</code> in your git repo with the following content:     <code>yaml\u00a0     ---     profile::slurm::controller::tfe_token: &lt;TFE API token&gt;     profile::slurm::controller::tfe_workspace: &lt;TFE workspace id&gt;</code>     Complete the file by replacing <code>&lt;TFE API TOKEN&gt;</code> with the token generated at step 1     and <code>&lt;TFE workspace id&gt;</code> (i.e.: <code>ws-...</code>) by the id of the workspace created at step 2.     It is recommended to encrypt the TFE API token before committing <code>data.yaml</code> in git. Refer     to section 4.15 of README.md to     know how to encrypt the token.</li> <li>Add <code>data.yaml</code> in git and push.</li> <li> <p>Modify <code>main.tf</code>:</p> <ol> <li>If not already present, add the following definition of the pool variable at the beginning of your <code>main.tf</code>.</li> </ol> <pre><code>variable \"pool\" { description = \"Slurm pool of compute nodes\" }\n</code></pre> <ol> <li>Add instances to <code>instances</code> with the tags <code>pool</code> and <code>node</code>. These are   the nodes that Slurm will able to create and destroy.</li> <li>If not already present, add the following line after the instances definition to pass the list of compute nodes from Terraform cloud workspace variable to the provider module:</li> </ol> <pre><code>pool = var.pool\n</code></pre> <ol> <li>On the right-hand-side of <code>public_keys =</code>, replace <code>[file(\"~/.ssh/id_rsa.pub\")]</code>   by a list of SSH public keys that will have admin access to the cluster.</li> <li>After the line <code>public_keys = ...</code>, add <code>hieradata = file(\"data.yaml\")</code>.</li> <li>Stage changes, commit and push to git repo.</li> </ol> </li> <li> <p>Go to your workspace in TFE, click on Actions -&gt; Start a new run -&gt; Plan and apply -&gt; Start run. Then, click on \"Confirm &amp; Apply\" and \"Confirm Plan\".</p> </li> <li>Compute nodes defined in step 8 can be modified at any point in the cluster lifetime and more pool compute nodes can be added or removed if needed.</li> </ol>"},{"location":"terraform_cloud/#considerations-for-autoscaling","title":"Considerations for autoscaling","text":"<p>To reduce the time required for compute nodes to become available in Slurm, consider creating a compute node image.</p> <p>JupyterHub will time out by default after 300 seconds if a node is not spawned yet. Since it may take longer than this to spawn a node, even with an image created, consider increasing the timeout by adding the following to your YAML configuration file:  <pre><code>jupyterhub::jupyterhub_config_hash:\n  SlurmFormSpawner:\n    start_timeout: 900\n</code></pre></p> <p>Slurm 23 adds the possibility for <code>sinfo</code> to report nodes that are not yet spawned. This is useful if you want JupyterHub to be aware of those nodes, for example if you want to allow to use GPU nodes without keeping them online at all time. To use that version of Slurm, add the following to your YAML configuration file: <pre><code>profile::slurm::base::slurm_version: '23.02'\n</code></pre></p>"},{"location":"terraform_cloud/#troubleshoot-autoscaling-with-terraform-cloud","title":"Troubleshoot autoscaling with Terraform Cloud","text":"<p>If after enabling autoscaling with Terraform Cloud for your Magic Castle cluster, the number of nodes does not increase when submitting jobs, verify the following points:</p> <ol> <li>Go to the Terraform Cloud workspace webpage, and look for errors in the runs. If the runs were only triggered by changes to the git repo, it means scaling signals from the cluster do not reach the Terraform cloud workspace or no signals were sent at all.</li> <li>Make sure the Terraform Cloud workspace id matches with the value of <code>profile::slurm::controller::tfe_workspace</code> in <code>data.yaml</code>.</li> <li>Execute <code>squeue</code> on the cluster, and verify the reasons why jobs are still in the queue. If under the column <code>(Reason)</code>, there is the keyword <code>ReqNodeNotAvail</code>, it implies Slurm tried to boot the listed nodes, but they would not show up before the timeout, therefore Slurm marked them as down. It can happen if your cloud provider is slow to build the instances, or following a configuration problem like in 2. When Slurm marks a node as down, a trace is left in slurmctld's log - using zgrep on the slurm controller node (typically <code>mgmt1</code>):     <pre><code>sudo zgrep \"marking down\" /var/log/slurm/slurmctld.log*\n</code></pre>     To tell Slurm these nodes are available again, enter the following command:     <pre><code>sudo /opt/software/slurm/bin/scontrol update nodename=node[Y-Z] state=IDLE\n</code></pre>     Replace <code>node[Y-Z]</code> by the hostname range listed next to <code>ReqNodeNotAvail</code> in <code>squeue</code>.</li> <li>Under <code>mgmt1:/var/log/slurm</code>, look for errors in the file <code>slurm_resume.log</code>.</li> </ol>"}]}
\ No newline at end of file
diff --git a/sequence/index.html b/sequence/index.html
index 259f911c..64cdee0c 100644
--- a/sequence/index.html
+++ b/sequence/index.html
@@ -16,7 +16,7 @@
       
       
       <link rel="icon" href="../assets/images/favicon.png">
-      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.42">
+      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.43">
     
     
       
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 23098322..549c605e 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ
diff --git a/terraform_cloud/index.html b/terraform_cloud/index.html
index 9e3fe0c8..29887bf1 100644
--- a/terraform_cloud/index.html
+++ b/terraform_cloud/index.html
@@ -14,7 +14,7 @@
       
       
       <link rel="icon" href="../assets/images/favicon.png">
-      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.42">
+      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.43">