National GPU Cluster - GeRP - GPU eResearch Platform

This repository contains the configuration for deploying 2 Slurm clusters inside a single Nectar project/tenancy.

The National GPU Cluster is 2 separate Slurm clusters. Each one based at Monash University and the University of Queensland.

The cluster you use, depends on where your data is located. Globus has(will) been installed to enable movement of your data to the compute.

This service has been configured to provide national access to compute running CryoSPARC. (www.cryosparc.com) To gain access, a user needs to obtain a license key.

For instructions on user provisioning and Strudel2 configuration, please refer to 'USER_PROVISIONING.md

Required NecTAR resources - www.nectar.org.au

In order to deploy this cluster, you require an Nectar project/tennancy with a minimum of hardware.

Suggested machine flavours:

t3.xsmall - basion host
t3.medium - login host, ideally 2 of these.
t3-medium - sql host
large machines - compute nodes. The quantity depends on your Requirements

Volume Storage:

this could be quite large depending on your requirements. Volume storage is required by the scripts to create 2 volumes.

Advanced Networking:

networks: 1 per cluster
routers: 1 per cluster
floating IPs: 1 per cluster

Create an new cluster using HPCasCode

HPCasCode is a repository of ansible scripts and a CI/CD pipeline which is used to build the clusters maintained by the Monash eResearch Centre HPC team, M3 and MonARCH. Learn more in the README in the git repository (https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode).

Log into your preferred repository:
- Gitlab: https://gitlab.erc.monash.edu.au/
- GitHub: https://github.com/
Create a new repository in where appropriate.
Clone the repository you just created to your local machine. Change into the directory. e.g.

cd national-gpu-cluster
Add the HPCasCode repository as a submodule (more on submodules https://www.adoclib.com/blog/git-fork-vs-submodules-vs-subtree.html.)

git submodule add [email protected]:hpc-team/HPCasCode.git
Commit the changes you just made and push them.

git add HPCasCode

git commit -m "adding submodule"

git push origin
You’ll need to create some folders and some symlinks to the HPCasCode submodule, and copy some files across too. Choose an appropriate cluster name. e.g. qcif or monash.

export CLUSTER=clusterName

mkdir files_$CLUSTER

mkdir vars_$CLUSTER

mkdir plays

mkdir ansible-ops

cd plays

##we can’t symlink the plays directory. We need each file.

ln -s ../HPCasCode/CICD/plays/*.yml .

cd ../

ln -s ./HPCasCode/CICD/master_playbook.yml .

ln -s ./HPCasCode/CICD/pre_templates .

cp ./HPCasCode/CICD/ansible.cfg ./ansible-ops/ansible_$CLUSTER.cfg

.. note::
```
 Edit ./ansible-ops/ansible_$CLUSTER.cfg
   –Before
     [ssh_connection]
     ssh_args = -F ./ssh.cfg

   —After
   [ssh_connection]
   ssh_args = -F ./ansible-ops/ssh_clusterName.cfg
   e.g.
   ssh_args = -F ./ansible-ops/ssh_qcif.cfg
```
cp HPCasCode/CICD/infra/os_create.yml os_create_$CLUSTER.yml

cp HPCasCode/CICD/infra/os_delete.yml os_delete_$CLUSTER.yml

cp HPCasCode/CICD/infra/os_vars.yml.j2 os_vars_$CLUSTER.yml
Set up an SSH key for managing the cluster. a. You can use the existing keys by copying the cert-authority line in /local_home/ec2-user/.ssh/authorized_keys on m3-login1. b. For MLeRP we created new keys and stored them in /local_home/ec2-user/mlerp_keys on m3-login1 - if you create your own just make sure you don’t push them to git or anything crazy (and ask for help if you’re unsure what “crazy” means.) c. Copy this line to use when editing the os_vars_$CLUSTER.yml file.
Activate a Python environment which contains ansible, python-openstackclient, python-novaclient, joblib and ssossh installed.

python3 -m venv python_venv

source python_venv/bin/activate

pip install ansible python-openstackclient python-novaclient joblib ssossh
Now you’ll need to edit the os_vars_$CLUSTER.yml file, paying particular attention to: a. clustername If the cluster is to be deployed at multiple Nectar sites, consider including the site name as this makes identifying instances easier. e.g. gerp-qcif, gerp-monash.

.. note::
```
 If you need to separate the words, use a ‘-’, not ‘_’ as this causes issues with the ansible scripts when trying to setup /etc/hosts on the nodes.
```
b. ssh public key From step 7, mlerp uses /home/ec2-user/mlerp_keys/adminca.pub c. Ext-network gerp-qcif uses QRIScloud. To determine the network available to you run:

openstack network list

d. Availability zone gerp-qcif uses QRIScloud. To determine the availability zone run:

openstack availability zone list
Edit os_create_$CLUSTER.yml for the virtual machine flavours you’ll use and adjust the storage sizes accordingly. If required, also adjust the size of the storage volumes to be created. You can find flavours with

openstack flavor list

Ensure you have access to a NecTAR tenancy and the Openstack RC file, and ensure you have sourced the file before running the following commands. https://tutorials.rc.nectar.org.au/openstack-cli/04-credentials

source National_CryoSPARC_Service-openrc.sh

Edit the file 'setcluster_gerp.sh' adding support for your selected clusterName. e.g. 'qcif', 'monash'
Source setcluster_gerp.sh, to configure your bash session ready for deployment of your cluster. This will setup the required symbolic links that support using a multi cluster deployment of this repository.

source setcluster_gerp.sh qcif

Run the ansible playbook to create your instances:

ansible-playbook os_create_$CLUSTER.yml

Create the inventory file which will contain a list of all the VMs you just created

python HPCasCode/scripts/make_inventory.py > inventory_$CLUSTER.yml

python HPCasCode/scripts/make_inventory.py gerp-qcif > inventory_$CLUSTER.yml

Create your own copy of the versions.yml file.

cp ./HPCasCode/CICD/vars/versions.yml ./vars_$CLUSTER/versions.yml

Using the recently created inventory_$CLUSTER.yml file, create the remaining files.

python HPCasCode/CICD/make_files.py inventory_$CLUSTER.yml os_vars_$CLUSTER.yml ./vars_$CLUSTER/versions.yml

python HPCasCode/CICD/make_files.py inventory_qcif.yml os_vars_qcif.yml vars_qcif/versions.yml

Create a customised ssh.cfg file.

mv ssh.cfg ansible-ops/ssh_$CLUSTER.cfg

..note ::

Edit ansible-ops/ssh_$CLUSTER.cfg Update the ‘UserKnownHostsFile’ to this path. e.g  UserKnownHostsFile ./known_hosts_qcif

Use ansible vault to encrypt the passwords file. You will need to choose a strong password for this cluster

ansible-vault encrypt vars/passwords.yml

Use ssossh to authenticate, to login. This part may be optional depending on the key used on step 7.

ssossh

ssh into the bastion node to ensure everything works as expected. Open the file inventory_$CLUSTER.yml to obtain the allocated IP.

ssh -F ansible-ops/ssh_$CLUSTER.cfg [email protected]

Ping all your instances to check you can access those too.

ansible -i inventory_$CLUSTER.yml -m ping all

Complete the symbolic links to the vars and files folders.

cd plays ; ln -s ../vars ; ln -s ../files

Deploy your cluster. Enjoy!!

ansible-playbook -i inventory_$CLUSTER.yml ./master_playbook.yml

Git add, git commit and git push the work you’ve just done to the remote repository!

Prepare you bash session for maintaining your cluster.

Change to folder of your repository.

cd national-gpu-cluster

Activate your python virtual environment.

source python_venv/bin/activate

Source your Nectar Openstack RC file.

source National_CryoSPARC_Service-openrc.sh

Set your cluster. e.g.

source setcluster_gerp.sh qcif

Now you are setup and ready to run your Ansible scripts.

Clone this repository to maintain your cluster.

Clone this repository

git clone --recurse-submodules https://gitlab.erc.monash.edu.au/hpc-team/national-gpu-cluster.git

Change to folder of your repository.

cd national-gpu-cluster

Create a python virtual enviroment.

python3 -m venv python_venv

Activate your python virtual environment.

source python_venv/bin/activate

Install python packages

pip install ansible python-openstackclient python-novaclient joblib ssossh

Source your Nectar Openstack RC file.

source National_CryoSPARC_Service-openrc.sh

Set your cluster to 'monash' or 'qcif'

source setcluster_gerp.sh qcif source setcluster_gerp.sh monash

Sign in as administrator

ssossh

Now you are ready to run Ansible to maintain your cluster.

Redeploying your cluster.

###Warning this is destructive. ###All storage and virtual machines will be deleted/destroyed.

Change to folder of your repository.

cd national-gpu-cluster

Activate your python virtual environment.

source python_venv/bin/activate

Source your Nectar Openstack RC file.

source National_CryoSPARC_Service-openrc.sh

Set your cluster. e.g.

source setcluster_gerp.sh qcif

Destroy the cluster.

ansible-playbook os_delete_$CLUSTER.yml

Recreate your instances:

ansible-playbook os_create_$CLUSTER.yml

Recreate the inventory file which will contain a list of all the VMs you just created

python HPCasCode/scripts/make_inventory.py > inventory_$CLUSTER.yml

python HPCasCode/scripts/make_inventory.py gerp-qcif > inventory_$CLUSTER.yml

Recreate the remaining files. The existing passwords file will need to be deleted.

rm vars/passwords.yml

python HPCasCode/CICD/make_files.py inventory_$CLUSTER.yml os_vars_$CLUSTER.yml ./vars_$CLUSTER/versions.yml

python HPCasCode/CICD/make_files.py inventory_qcif.yml os_vars_qcif.yml vars_qcif/versions.yml

Recreate the customised ssh.cfg file.

vi ansible-ops/ssh_$CLUSTER.cfg

..note ::

Edit ansible-ops/ssh_$CLUSTER.cfg Update the ‘UserKnownHostsFile’ to this path. e.g  UserKnownHostsFile ./known_hosts_qcif. A backup file containing previous settings may be found at 'ansible-ops/ssh_$CLUSTER-BAK.cfg

Delete the existing 'known_hosts' file, as this will cause SSH warnings when the new cluster is deployed.

rm known_hosts_$CLUSTER

The GERP cluster has some special settings required for deployment that are over written by Step 8. Use the '-BAK' files to restore these special settings.

cp ansible-ops/gres-BAK.conf files/gres.conf cp ansible-ops/job_container-BAK-$CLUSTER.conf files/job_container.conf cp ansible-ops/mig_config-BAK.yml files/mig_config.yml cp ansible-ops/slurm-BAK-$CLUSTER.conf files/slurm.conf

Use ansible vault to encrypt the passwords file. You will need to choose a strong password for this cluster

ansible-vault encrypt vars/passwords.yml

Use ssossh to authenticate, to login. This part may be optional depending on the key used on step 7, when setting up the cluster originally.

ssossh

ssh into the bastion node to ensure everything works as expected. Open the file inventory_$CLUSTER.yml to obtain the allocated IP.

ssh -F ansible-ops/ssh_$CLUSTER.cfg [email protected]

Ping all your instances to check you can access those too.

ansible -i inventory_$CLUSTER.yml -m ping all

ReDeploy your cluster. Enjoy!!

ansible-playbook -i inventory_$CLUSTER.yml ./master_playbook.yml

Git add, git commit and git push the work you’ve just done to the remote repository!

Setup CryoSPARC:

Ensure the requirements are setup to run CryoSPARC on the cluster.

ansible-playbook -i inventory_$CLUSTER.yml ./cryosparc.yml

Setup Globus:

Globus was manually configured. Please refer to the files for details:

Globus Install GERP MONASH.txt
Globus Install GERP QCIF.txt

A few sample commands to assist:

ansible-playbook -i inventory_$CLUSTER.yml

After updating node settings, slurm may need to be restarted.

ansible -i inventory_$CLUSTER.yml -l ComputeNodes -m shell -a "systemctl restart slurmd" all --become

ansible -i inventory_$CLUSTER.yml -l LoginNodes -m shell -a "systemctl restart slurmctld" all --become

Destroy all virtual machines and volume storage to redeploy the whole slurm cluster.

ansible-playbook os_delete_$CLUSTER.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

National GPU Cluster - GeRP - GPU eResearch Platform

Required NecTAR resources - www.nectar.org.au

Create an new cluster using HPCasCode

Prepare you bash session for maintaining your cluster.

Clone this repository to maintain your cluster.

Redeploying your cluster.

Setup CryoSPARC:

Setup Globus:

A few sample commands to assist:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
HPCasCode @ c63380a		HPCasCode @ c63380a
ansible-ops		ansible-ops
cryosparc_files		cryosparc_files
files_monash		files_monash
files_qcif		files_qcif
plays		plays
templates		templates
vars_monash		vars_monash
vars_qcif		vars_qcif
.gitignore		.gitignore
.gitmodules		.gitmodules
Globus Install GERP MONASH.txt		Globus Install GERP MONASH.txt
Globus Install GERP QCIF.txt		Globus Install GERP QCIF.txt
LICENSE		LICENSE
LiberTEM install-setup.txt		LiberTEM install-setup.txt
README.md		README.md
USER_PROVISIONING.md		USER_PROVISIONING.md
ansible.cfg		ansible.cfg
cryosparc.yml		cryosparc.yml
files		files
inventory_monash.yml		inventory_monash.yml
inventory_qcif.yml		inventory_qcif.yml
known_hosts_monash		known_hosts_monash
known_hosts_qcif		known_hosts_qcif
master_playbook.yml		master_playbook.yml
os_create_monash.yml		os_create_monash.yml
os_create_qcif.yml		os_create_qcif.yml
os_delete_monash.yml		os_delete_monash.yml
os_delete_qcif.yml		os_delete_qcif.yml
os_vars.yml		os_vars.yml
os_vars_monash.yml		os_vars_monash.yml
os_vars_qcif.yml		os_vars_qcif.yml
pre_templates		pre_templates
setcluster_gerp.sh		setcluster_gerp.sh
setup_provisioning.yml		setup_provisioning.yml
ssh.cfg		ssh.cfg
vars		vars

License

Characterisation-Virtual-Laboratory/EM-Data-Processing-Portal

Folders and files

Latest commit

History

Repository files navigation

National GPU Cluster - GeRP - GPU eResearch Platform

Required NecTAR resources - www.nectar.org.au

Create an new cluster using HPCasCode

Prepare you bash session for maintaining your cluster.

Clone this repository to maintain your cluster.

Redeploying your cluster.

Setup CryoSPARC:

Setup Globus:

A few sample commands to assist:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages