Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OFED clash with OHPC #465

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 31 additions & 28 deletions .github/workflows/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,37 +15,26 @@ jobs:
openstack:
name: openstack-imagebuild
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.os_version }}-${{ matrix.build }} # to branch/PR + OS + build
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.builds.label }}
cancel-in-progress: true
runs-on: ubuntu-22.04
strategy:
fail-fast: false # allow other matrix jobs to continue even if one fails
matrix: # build RL8+OFED, RL9+OFED, RL9+OFED+CUDA versions
os_version:
- RL8
- RL9
build:
- openstack.openhpc
- openstack.openhpc-cuda
exclude:
- os_version: RL8
build: openstack.openhpc-cuda
builds:
- label: openhpc-RL8-ofed
source_image_name: RL8-ofed
inventory_groups: 'control,login,compute'
- label: openhpc-RL9-ofed
source_image_name: RL9-ofed
inventory_groups: 'control,login,compute'
- label: openhpc-RL9-cuda
source_image_name: RL9-cuda
inventory_groups: 'control,login,compute'
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack
CI_CLOUD: ${{ github.event.inputs.ci_cloud }}
SOURCE_IMAGES_MAP: |
{
"RL8": {
"openstack.openhpc": "rocky-latest-RL8",
"openstack.openhpc-cuda": "rocky-latest-cuda-RL8"
},
"RL9": {
"openstack.openhpc": "rocky-latest-RL9",
"openstack.openhpc-cuda": "rocky-latest-cuda-RL9"
}
}

steps:
- uses: actions/checkout@v2

Expand Down Expand Up @@ -79,6 +68,20 @@ jobs:
. venv/bin/activate
. environments/.stackhpc/activate

- name: Select branch-specific or latest nightly image
id: select_source_image
run: |
. venv/bin/activate
. environments/.stackhpc/activate
BRANCH=${{ github.ref_name }}
BRANCH_VERSION=${BRANCH//\//-} # replace '/' with '-' using bash parameter expansion
NIGHTLY_IMAGE_ID=$( \
openstack image show -c id -f value ${{ matrix.builds.source_image_name }}-${BRANCH_VERSION} || \
openstack image show -c id -f value ${{ matrix.builds.source_image_name }}-latest \
)
echo selected source_image $NIGHTLY_IMAGE_ID: $(openstack image show -c name -f value $NIGHTLY_IMAGE_ID)
echo "source_image_id=$NIGHTLY_IMAGE_ID" >> "$GITHUB_OUTPUT"

- name: Build fat image with packer
id: packer_build
run: |
Expand All @@ -88,15 +91,15 @@ jobs:
cd packer/
packer init .

PACKER_LOG=1 packer build \
packer build \
-on-error=${{ vars.PACKER_ON_ERROR }} \
-only=${{ matrix.build }} \
-var-file=$PKR_VAR_environment_root/${{ env.CI_CLOUD }}.pkrvars.hcl \
-var "source_image_name=${{ env.SOURCE_IMAGE }}" \
-var source_image=${{ steps.select_source_image.outputs.source_image_id }} \
-var image_name=${{ matrix.builds.label }} \
-var inventory_groups=${{ matrix.builds.inventory_groups }} \
openstack.pkr.hcl
env:
PKR_VAR_os_version: ${{ matrix.os_version }}
SOURCE_IMAGE: ${{ fromJSON(env.SOURCE_IMAGES_MAP)[matrix.os_version][matrix.build] }}
PACKER_LOG: '1'

- name: Get created image names from manifest
id: manifest
Expand All @@ -113,7 +116,7 @@ jobs:
- name: Upload manifest artifact
uses: actions/upload-artifact@v4
with:
name: image-details-${{ matrix.build }}-${{ matrix.os_version }}
name: image-details-${{ matrix.builds.label }}
path: |
./image-id.txt
./image-name.txt
Expand Down
82 changes: 39 additions & 43 deletions .github/workflows/nightlybuild.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# NB: When run in a non-main branch (via workflow_dispatch), image scanning and distribution to other clouds does not happen
# on the basis that in this case a fatimage must be built and will be scanned.
name: Build nightly image
on:
workflow_dispatch:
Expand All @@ -14,34 +16,30 @@ on:
- cron: '0 0 * * *' # Run at midnight

jobs:
openstack:
name: openstack-imagebuild
build:
name: nightly-imagebuild
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.os_version }}-${{ matrix.build }} # to branch/PR + OS + build
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.builds.label }}
cancel-in-progress: true
runs-on: ubuntu-22.04
strategy:
fail-fast: false # allow other matrix jobs to continue even if one fails
matrix: # build RL8, RL9, RL9+CUDA versions
os_version:
- RL8
- RL9
build:
- openstack.rocky-latest
- openstack.rocky-latest-cuda
exclude:
- os_version: RL8
build: openstack.rocky-latest-cuda

matrix:
builds:
- label: RL8-ofed
source_image_name: Rocky-8-GenericCloud-Base-8.9-20231119.0.x86_64.qcow2
inventory_groups: 'update,ofed'
- label: RL9-ofed
source_image_name: Rocky-9-GenericCloud-Base-9.4-20240523.0.x86_64.qcow2
inventory_groups: 'update,ofed'
- label: RL9-cuda
source_image_name: Rocky-9-GenericCloud-Base-9.4-20240523.0.x86_64.qcow2
inventory_groups: 'update,ofed,cuda'
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack
CI_CLOUD: ${{ github.event.inputs.ci_cloud || vars.CI_CLOUD }}
SOURCE_IMAGES_MAP: |
{
"RL8": "Rocky-8-GenericCloud-Base-8.9-20231119.0.x86_64.qcow2",
"RL9": "Rocky-9-GenericCloud-Base-9.4-20240523.0.x86_64.qcow2"
}
IMAGE_VERSION: ${{ github.event_name == 'schedule' && 'latest' || github.ref_name }}

steps:
- uses: actions/checkout@v2
Expand Down Expand Up @@ -85,18 +83,18 @@ jobs:
cd packer/
packer init .

PACKER_LOG=1 packer build \
packer build \
-on-error=${{ vars.PACKER_ON_ERROR }} \
-only=${{ matrix.build }} \
-var-file=$PKR_VAR_environment_root/${{ env.CI_CLOUD }}.pkrvars.hcl \
-var "source_image_name=${{ env.SOURCE_IMAGE }}" \
-var source_image_name=${{ matrix.builds.source_image_name }} \
-var image_name=${{ matrix.builds.label }} \
-var image_version=${{ env.IMAGE_VERSION }} \
-var inventory_groups=${{ matrix.builds.inventory_groups }} \
openstack.pkr.hcl

env:
PKR_VAR_os_version: ${{ matrix.os_version }}
SOURCE_IMAGE: ${{ fromJSON(env.SOURCE_IMAGES_MAP)[matrix.os_version] }}
PACKER_LOG: '1'

- name: Get created image names from manifest
- name: Get image info and ensure it can be used for subsequent builds
id: manifest
run: |
. venv/bin/activate
Expand All @@ -105,8 +103,10 @@ jobs:
sleep 5
done
IMAGE_NAME=$(openstack image show -f value -c name $IMAGE_ID)
echo image: ${IMAGE_NAME} ${IMAGE_ID}
echo "image-name=${IMAGE_NAME}" >> "$GITHUB_OUTPUT"
echo "image-id=$IMAGE_ID" >> "$GITHUB_OUTPUT"
openstack image unset --property signature_verified $IMAGE_ID

- name: Delete old latest image
run: |
Expand All @@ -122,9 +122,10 @@ jobs:

upload:
name: upload-nightly-targets
needs: openstack
needs: build
if: github.ref_name == 'main'
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.os_version }}-${{ matrix.image }}-${{ matrix.target_cloud }}
group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.builds.label }}-${{ matrix.target_cloud }}
cancel-in-progress: true
runs-on: ubuntu-22.04
strategy:
Expand All @@ -134,21 +135,16 @@ jobs:
- LEAFCLOUD
- SMS
- ARCUS
os_version:
- RL8
- RL9
image:
- rocky-latest
- rocky-latest-cuda
builds:
- image: RL8-ofed-latest
- image: RL9-ofed-latest
- image: RL9-cuda-latest
exclude:
- os_version: RL8
image: rocky-latest-cuda
- target_cloud: LEAFCLOUD
- target_cloud: LEAFCLOUD # why?? Should this not be source_cloud/vars.CI_CLOUD
env:
OS_CLOUD: openstack
SOURCE_CLOUD: ${{ github.event.inputs.ci_cloud || vars.CI_CLOUD }}
TARGET_CLOUD: ${{ matrix.target_cloud }}
IMAGE_NAME: "${{ matrix.image }}-${{ matrix.os_version }}"
steps:
- uses: actions/checkout@v2

Expand Down Expand Up @@ -176,16 +172,16 @@ jobs:
run: |
. venv/bin/activate
export OS_CLIENT_CONFIG_FILE=~/.config/openstack/source_clouds.yaml
openstack image save --file ${{ env.IMAGE_NAME }} ${{ env.IMAGE_NAME }}
openstack image save --file ${{ matrix.builds.image }} ${{ matrix.builds.image }}
shell: bash

- name: Upload to target cloud
run: |
. venv/bin/activate
export OS_CLIENT_CONFIG_FILE=~/.config/openstack/target_clouds.yaml

openstack image create "${{ env.IMAGE_NAME }}" \
--file "${{ env.IMAGE_NAME }}" \
openstack image create "${{ matrix.builds.image }}" \
--file "${{ matrix.builds.image }}" \
--disk-format qcow2 \
shell: bash

Expand All @@ -194,9 +190,9 @@ jobs:
. venv/bin/activate
export OS_CLIENT_CONFIG_FILE=~/.config/openstack/target_clouds.yaml

IMAGE_COUNT=$(openstack image list --name ${{ env.IMAGE_NAME }} -f value -c ID | wc -l)
IMAGE_COUNT=$(openstack image list --name ${{ matrix.builds.image }} -f value -c ID | wc -l)
if [ "$IMAGE_COUNT" -gt 1 ]; then
OLD_IMAGE_ID=$(openstack image list --sort created_at:asc --name "${{ env.IMAGE_NAME }}" -f value -c ID | head -n 1)
OLD_IMAGE_ID=$(openstack image list --sort created_at:asc --name "${{ matrix.builds.image }}" -f value -c ID | head -n 1)
openstack image delete "$OLD_IMAGE_ID"
else
echo "Only one image exists, skipping deletion."
Expand Down
2 changes: 2 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,5 @@ roles/*
!roles/tuned/**
!roles/lustre/
!roles/lustre/**
!roles/builder/
!roles/builder/**
10 changes: 8 additions & 2 deletions ansible/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -207,8 +207,14 @@
gather_facts: yes
tags: finalise
tasks:
- name: Cleanup image
import_tasks: cleanup.yml
- name: Carry out checks on image
import_role:
name: builder
tasks_from: checks.yml
- name: Finalise image
import_role:
name: builder
tasks_from: finalise.yml

- name: Shutdown Packer VM
community.general.shutdown:
1 change: 1 addition & 0 deletions ansible/roles/builder/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
builder_delete_syslog: false
29 changes: 29 additions & 0 deletions ansible/roles/builder/tasks/checks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
- name: Check whether OFED is installed
command: ofed_info
changed_when: false
failed_when:
- _ofed_info.rc > 0
- "'No such file or directory' not in _ofed_info.msg"
register: _ofed_info

- name: Get package facts
package_facts:

- name: Check e.g. libfabric package hasn't downgraded OFED-installed packages
assert:
that: "'mlnx' in ansible_facts.packages[item].0.version"
fail_msg: "OFED is installed but package {{ item }} has a non-OFED version: {{ ansible_facts.packages[item].0.version }}"
when: "'MLNX_OFED_LINUX-' in _ofed_info.stdout"
loop: "{{ builder_ofed_check_packages }}"
vars:
builder_ofed_check_packages:
- ibacm
- infiniband-diags
- libibumad
- libibverbs
- libibverbs-utils
- librdmacm
- librdmacm-utils
- rdma-core-devel
- rdma-core # didn't actually see this one get downgraded

75 changes: 75 additions & 0 deletions ansible/roles/builder/tasks/finalise.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Finalise a Packer build VM

- meta: flush_handlers

- name: Remove dnf caches
command: dnf clean all

# If image build happens on a Neutron subnet with property dns_namservers defined, then cloud-init
# disables NetworkManager's control of /etc/resolv.conf and appends nameservers itself.
# We don't want network configuration during instance boot to depend on the configuration
# of the network the builder was on, so we reset these aspects.
- name: Delete /etc/resolv.conf
file:
path: /etc/resolv.conf
state: absent
when: "'resolv_conf' not in group_names" # if its been overriden, deleting it is the wrong thing to do

- name: Reenable NetworkManager control of resolv.conf
# NB: This *doesn't* delete the 90-dns-none.conf file created by the resolv_conf role
# as if nameservers are explicitly being set by that role we don't want to allow NM
# to override it again.
file:
path: /etc/NetworkManager/conf.d/99-cloud-init.conf
state: absent

- name: Get remote environment for ansible_user
setup:
gather_subset: env
become: no

- name: Delete any injected ssh config for ansible_user
file:
path: "{{ ansible_env.HOME }}/.ssh/"
state: absent

- name: Run cloud-init cleanup
command: cloud-init clean --logs --seed

- name: Cleanup /tmp
command : rm -rf /tmp/*

- name: Get package facts
package_facts:

- name: Ensure image summary directory exists
file:
path: /var/lib/image/
state: directory
owner: root
group: root
mode: u=rwX,go=rX

- name: Write image summary
copy:
content: "{{ image_info | to_nice_json }}"
dest: /var/lib/image/image.json
vars:
image_info:
branch: "{{ lookup('pipe', 'git rev-parse --abbrev-ref HEAD') }}"
build: "{{ ansible_nodename | split('.') | first }}" # hostname is image name, which contains build info
os: "{{ ansible_distribution }} {{ ansible_distribution_version }}"
kernel: "{{ ansible_kernel }}"
ofed: "{{ ansible_facts.packages['mlnx-ofa_kernel'].0.version | default('-') }}"
cuda: "{{ ansible_facts.packages['cuda'].0.version | default('-') }}"
slurm-ohpc: "{{ ansible_facts.packages['slurm-ohpc'].0.version | default('-') }}"
ondemand: "{{ ansible_facts.packages['ondemand'].0.version | default('-') }}"

- name: Clear system logs
file:
path: /var/log/messages
state: absent
when: "{{ builder_delete_syslog | bool }}"

- name: Shutdown Packer VM
community.general.shutdown:
Loading
Loading