Skip to content

Infrastructure Management

Enrique Garcia edited this page Apr 18, 2024 · 14 revisions

The following documentation contains the different steps on the VRE infrastructure and its deployment.

Cluster Setup

This documentation is generalised as much as possible. However, CERN make use of certain resources that can only be used and/or managed by CERN users. The sections that are CERN dependant will be specified along the document.

Table of contents

  1. Create a k8s cluster in CERN OpenStack
  2. How to connect to the cluster
  3. Software stack to manage the cluster
  4. Create a git repository to manage the cluster
  5. Cluster configuration

1. Create a k8s cluster in CERN OpenStack

Note that CERN OpenStack is a CERN service. You would need to adapt this section to your own service provide.

The create of a CERN OpenStack cluster can only be done from aiadm. CERN k8s documentation (CERN restricted access) can be found in this link.

The cluster can be created either via the OpenStack GUI, either command line commands. Please adapt the following snippet to your own use case.

#/bin/bash

$ openstack coe cluster create <CLUSTER_NAME> \
     --keypair <KEYPAIR> \
     --cluster-template <K8S_VERSION>  \
     --master-count 1 \
     --node-count 6 \
     --flavor m2.large \
     --master-flavor m2.medium \
     --merge-labels \
     --labels <key1>=<value1> \
     --labels <key2>=<value2> 

2. How to connect to the cluster

Once the cluster has been correctly created, use the following commands to download the configuration file to connect to the cluster

$ openstack coe cluster list
# This command will create a `config` 
$ openstack coe cluster config <CLUSTER_NAME> 

Finally move this file to the ~/.kube/ directory and export the following env variable

mv config ~/.kube/config_<CLUSTER_NAME>
export KUBECONFIG=$HOME/.kube/config_<CLUSTER_NAME>

3. Software stack to manage the cluster

For the CERN VRE, we are using the following software stack to manage and interact with the cluster. We highly recommend creating a software package manager to isolate this environment.

4. Create a git repository and use flux to manage the cluster

Create a git repository

Connect it with Flux

Flux was installed manually via: flux bootstrap github --owner=vre-hub --repository=vre --branch=main --path=infrastructure/cluster/flux-v2 --author-name flux-ops with version v2.0.0-rc.5.

Flux version was set to v2.0.0-rc.5. Higher flux versions are incompatible with the current cluster version (v1.26.7 - April '24). To install this flux specific version run curl -s https://fluxcd.io/install.sh | sudo FLUX_VERSION=v2.0.0-rc.5 bash

  • To bootstrap the repository you will need to pass a valid GitHub PAT.
  • After running the above command, a new deploy-key will be automatically set up in the repository configuration under the username of the person that run the command.

Using the above command, all the manifests inside the path infrastructure/cluster/flux-v2 will be automatically deployed to the VRE cluster.

Refer to the official flux docs for information on how to add manifests e. g. helm charts and add kustomizations.

5. Cluster configuration

Ingress nodes

Set up some nodes as ingress nodes, for future ingress to the cluster

#/bin/bash

NODE_PREFIX=$(kubectl get nodes -l magnum.openstack.org/role=worker --sort-by .metadata.name -o name | head -n 1)
NODE_PREFIX=${NODE_PREFIX%-0}
echo $NODE_PREFIX

# example with 3 first nodes
FIRST_3NODES=$(kubectl get nodes -l magnum.openstack.org/role=worker --sort-by .metadata.name -o name | head -n 3)
for NODE in $FIRST_3NODES
do
    kubectl label "$NODE" role=ingress
done

Create a new DB

To create and request a new db instance at: https://dbod.web.cern.ch/pages/dashboard Have a look to the section in the wiki

Bootstrap the DB

Check Rucio documentation or run the following snippet, after being adapted

#/bin/bas
docker run --rm \
  -e RUCIO_CFG_DATABASE_DEFAULT="<DB_URL>" \
  -e RUCIO_CFG_BOOTSTRAP_USERPASS_IDENTITY="<USER>" \
  -e RUCIO_CFG_BOOTSTRAP_USERPASS_PWD="<PASS>" \
  rucio/rucio-init

Install SealedSecrets in the cluster

Have a look to https://github.com/vre-hub/vre/wiki/Infrastructure-Management#secrets-management

Install the different services via Flux and Helm

You have two strategies to use Flux + Helm

  • Either pushing the flux/k8s manifest directly to the cluster and debugging using flux,
  • either applying the manifest by hand kubectl apply ...., and after having debug any problem, apply it to the cluster - Flux in this case won't do anything as the status has already been applied

Old documentation


The following contains information on the infrastructure and its deployment.

Cluster Setup

Resources are created either via Terraform or flux. See more details here.

Namespaces

Namespace Description
shared-services Namespace for shared resources
flux-system Namespace for flux
rucio-vre Namespace for rucio resources
reana Namespace for reana resources
jhub Namespace for JupyterHub

Networking and Firewall

The two necessary services, Rucio and JupyterHub, are exposed to the internet. The ingress controllers are part of a LandDB set; see here for documentation. To add nodes as Adding new nodes s additional ingresses, add them to the LanDB set VRE-INGRESS-NODES.

Certificates and Domains

While Rucio relies on CERN Host- and Grid Robot Certificates, we also have a globally valid Sectigo certificate covering the following domains:

  • vre.cern.ch (not used atm)
  • rucio-auth-vre.cern.ch (not used atm)
  • rucio-ui-vre.cern.ch (not used atm)
  • rucio-vre.cern.ch (not used atm)
  • reana-vre.cern.ch (not used atm)
  • jhub-vre.cern.ch (used for JupyterHub)

The TLS key and cert can be found in tbag.

In the future, certificates shall be issued automatically within the cluster through cert-manager, please refer to for more information on CERN setup.

Database

The VRE databases run in an instance on CERNs Database On Demand (DBOD). The instance is called vre and connections details can be found in tbag (each db has a dedicated r/w user for usage with an application - do not use the main admin user).

URL: https://dbod.web.cern.ch/ User Guide: https://dbod-user-guide.web.cern.ch/

At the moment, there are three central databases in the vre postgres instance:

  • rucio for the Rucio Helm Chart Installation rucio-cvre-*
  • reana for the REANA Helm Chart Installation reana-cvre-*
  • jhub for the Zero to JupyterHub Helm Installation jhub-cvre-*

The output of the PSQL \l command:

List of databases
   Name    |   Owner   | Encoding |   Collate   |    Ctype    | ICU Locale | Locale Provider |   Access privileges   
-----------+-----------+----------+-------------+-------------+------------+-----------------+-----------------------
 admin     | <output-omitted>     | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | 
 dod_dbmon | <output-omitted> | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | 
 dod_pmm   | <output-omitted>    | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | 
 jhub      | <output-omitted>       | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | 
 postgres  | <output-omitted>   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | 
 reana     | <output-omitted>      | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | 
 rucio     | <output-omitted>      | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | =Tc/admin            +
           |           |          |             |             |            |                 | admin=CTc/admin      +
           |           |          |             |             |            |                 | rucio=CTc/admin
 template0 | <output-omitted>  | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | =c/postgres          +
           |           |          |             |             |            |                 | postgres=CTc/postgres
 template1 | <output-omitted>   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | =c/postgres          +
           |           |          |             |             |            |                 | postgres=CTc/postgres
(9 rows)

Additional databases may be added; see the PSQL section. This requires adding the host to pg_hba.conf (edit on the DBOD Dashboard - see DBOD docs and here:

..
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host   <db>        <user>               0.0.0.0/0                      md5
..

Secrets Management

Create a sealed secret file running the command below:

kubectl create secret generic secret-name --dry-run=client --from-literal=foo=bar -o [json|yaml] | \
kubeseal \
    --controller-name=sealed-secrets-controller \
    --controller-namespace=kube-system \
    --format yaml > mysealedsecret.[json|yaml]

kubectl create -f mysealedsecret.[json|yaml]

In order to (1) encrypt the secret from a .yaml file, (2) save it to the github repository and (3) apply it to the cluster, run the following script by first determining the variables:

  • namespace: export NS="<your_ns>"
  • where is the .yaml containing your secret: export yaml_file_secret="<your_yaml_file>"
  • path to where you cloned this repo: export SECRETS_STORE="<path_to_github_dir_>/vre-hub/vre/iac/secrets/rucio/"
# kubeseal controller namespace
CONTROLLER_NS="shared-services"
CONTROLLER_NAME="sealed-secrets-cvre"
YAML_PRFX="ss_"

# name of output secret to apply
OUTPUT_SECRET=${SECRETS_STORE}${YAML_PRFX}${yaml_file_secret}

echo "Create and apply main server secrets"

cat ${yaml_file_secret} | kubeseal --controller-name=${CONTROLLER_NAME} --controller-namespace=${CONTROLLER_NS} --format yaml --namespace=${NS} > ${OUTPUT_SECRET}

kubectl apply -f ${OUTPUT_SECRET}

Services

There are two ways to expose as service to the intranet or internet (with firewall opening) (at CERN):

Option A: LoadBalancer

The LoadBalancer is created automatically (see LBaas CERN, kubernetes-service-type-loadbalancer and LBaaS) after that the URL needs to set as a description:

openstack loadbalancer set --description my-domain-name mylb

Some useful commands:

openstack loadbalancer list
openstack loadbalancer show <lb>
openstack server list
openstack server show <server>
openstack server unset --property <property> <server>
openstack server show <name>

To open the firewall for the services that have been created, make a firewall request.

These are our firewall openings details at CERN:

If you want to allow you to expose the services for getting a certificate from e.g. lets encrypt, follow this documentation, which will create some certificates to allow external traffic. The CERN e-group is in our case called ESCAPE-WP2-CERN-ACME-LANDBSET and we have a set associated to it called ESCAPE-WP2-CERN-ACME.

If you experience slow service issues, or you want to find out about LoadBalancing at CERN: -Firewall https://clouddocs.web.cern.ch/containers/tutorials/firewall.html -Session Persistence https://clouddocs.web.cern.ch/networking/load_balancing.html#session-persistence -Getting a keytab for a load balancer https://clouddocs.web.cern.ch/networking/load_balancing.html#getting-a-keytab-for-a-load-balancer

Option B: NodePort and Ingress, this requires several worker nodes set with role=ingress:

kubectl label <node-name> role=ingress
kubectl label <node-name> role=ingress

IaC

Infrastructure as code is used to create cluster resources. We use terraform and flux as tools for that purpose. The used terraform providers are listed below.

Terraform

Terraform supports many different providers that can be used to create resources for a specific component like OpenStack, Helm, or K8s.

Terraform is very well documented and supported, please read through the docs if you're new to Terraform.

Terraform Providers

Terraform OpenStack Provider

⚠️ The OpenStack terraform provider does not support changes to existing resources. Every change will result in a replacement operation!

Terraform Helm Provider

The Terraform Helm provider supports two ways of declaring Helm Chart values, either by values file:

  values = [
    "${file("rucio/values-server.yaml")}"
  ]

or by setting single values:

  set {
    name = "config.database.default"
    value = data.kubernetes_secret_v1.rucio_db_secret.data.dbconnectstring
  }

so an entire Helm Chart definition with Terraform could look like this:

resource "helm_release" "rucio-server-chart" {
  name       = "rucio-server-${var.resource-suffix}"
  repository = "https://rucio.github.io/helm-charts"
  chart      = "rucio-server"
  version    = "1.30.0"
  namespace  = var.ns-rucio

  values = [
    "${file("rucio/values-server.yaml")}"
  ]

  set {
    name = "config.database.default"
    value = data.kubernetes_secret_v1.rucio_db_secret.data.dbconnectstring
  }
}

Flux

Flux was installed manually via: flux bootstrap github --owner=vre-hub --repository=vre --branch=main --path=infrastructure/cluster/flux --author-name flux-ops

Manifests inside the path infrastructure/cluster/flux-v2 will be automatically deployed to the VRE cluster.

Refer to the official flux docs for information on adding manifests e. g. helm charts and kustomizations.

EOS

EOS is mounted through a daemon set to labelled nodes and allows the mount propagation to the users in JupyterLab. For specific setup instructions, refer to the EOS Docs and the CERN K8s EOS Docs. Important is that the OAuth token as at least rw (chmod 0600) permission set.