Skip to content

Commit

Permalink
Merge branch 'main' into merge_rucio_configs
Browse files Browse the repository at this point in the history
  • Loading branch information
garciagenrique authored Aug 13, 2024
2 parents f70b1a1 + 92254a9 commit 61a83d8
Show file tree
Hide file tree
Showing 67 changed files with 13,191 additions and 1,464 deletions.
9 changes: 5 additions & 4 deletions AUTHORS.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# Authors

List of contributors in alphabetical order:
List of contributors:

- Enrique Garcia Garcia <[email protected]>
- Elena Gazzarrini <[email protected]>
- Domenic Gosein <[email protected]>
- Elena Gazzarrini (CERN), 2022-2024
- Enrique Garcia Garcia (CERN), 2022-2025
- Domenic Gosein (CERN), 2022-2023
- Giovanni Guerrieri (CERN), 2024-
27 changes: 15 additions & 12 deletions containers/base-ops/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,34 +1,37 @@
ARG BASEIMAGE=rucio/rucio-server
ARG BASETAG=release-1.30.0
ARG BASETAG=release-34.6.0
ARG BUILD_DATE

FROM ${BASEIMAGE}:${BASETAG}
LABEL maintainer="VRE Team @ CERN 22/23 - E. Garcia, E. Gazzarrini, D. Gosein"
LABEL maintainer="VRE Team @ CERN 23/24 - E. Garcia, G. Guerrieri"
LABEL org.opencontainers.image.source https://github.com/vre-hub/vre
LABEL org.label-schema.build-date=${BUILD_DATE}

USER root

# Install epel-relaseas
RUN dnf install -y epel-release

# cleanup yum cache
RUN yum upgrade -y \
&& yum clean all \
&& rm -rf /var/cache/yum
RUN dnf upgrade -y \
&& dnf clean all \
&& rm -rf /var/cache/dnf

# install useful tools
RUN yum -y install git htop wget voms-clients-cpp
RUN pip install --upgrade pip
RUN dnf -y install git htop wget voms-clients-cpp
RUN python3 -m pip install --upgrade pip

# EGI trust anchors
RUN curl -Lo /etc/yum.repos.d/egi-trustanchors.repo https://repository.egi.eu/sw/production/cas/1/current/repo-files/egi-trustanchors.repo \
&& yum update -y
&& dnf update -y

RUN yum -y install gfal2* python3-gfal2 xrootd-client voms-clients-java
RUN yum -y install ca-certificates ca-policy-egi-core
RUN dnf -y install gfal2* python3-gfal2 xrootd-client voms-clients-java
RUN dnf -y install ca-certificates ca-policy-egi-core

# Install CERN CA certs from CERN maintained mirrors
# This will add a `CERN-bundle.pem` file (among others) into `/etc/pki/tls/certs/`
COPY ./linuxsupport7s-stable.repo /etc/yum.repos.d/
RUN yum install -y CERN-CA-certs
RUN dnf -y --repofrompath='tmpcern,https://linuxsoft.cern.ch/cern/alma/$releasever/CERN/$basearch/' upgrade almalinux-release --nogpgcheck
RUN dnf install -y CERN-CA-certs

# ESCAPE VOMS setup
RUN mkdir -p /etc/vomses \
Expand Down
10 changes: 0 additions & 10 deletions containers/base-ops/linuxsupport7s-stable.repo

This file was deleted.

86 changes: 86 additions & 0 deletions infrastructure/cluster/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# CERN-VRE cluster

## Cluster administration

The CERN-VRE cluster is composed of 3 master nodes plus 20 worker nodes, running on the CERN OpenStack instance.

- master nodes: `m2.large` flavour; 4 VCPUs, 7.3 GB RAM and 40 GB.
- worker nodes: `m2.xlarge` flavour; 8 VCPUs, 14.6 GB RAM and 80 GB.

10 nodes (0 to 9) nodes are "reserved" for infrastructure management; k8s, rucio, jhub, reana ...
10 nodes (10 to 22 - nodes 14, 16 and 21 don't exist) are tagged for computing purposes; jhub-sessions and os, cvmfs and CephFS "connectors". WIP: reana sessions should be spawned here too.

To date (14 Feb 2024), nodes have been labelled as follows;
`kubectl label node <NODE_NAME>> jupyter=singleuser`
and these same 10 nodes need to be tainted to only allow jupyter sessions too
`kubectl taint nodes <NODE_NAME> jupyter=singleuser:NoSchedule`

To date (21 jun 2024), nodes and tains removed.
Reana was not able to reach cvmfs (ds was not deploying any nodeplugin on the nodes, because of the above restrictions). It was easier to un taint and un label everything, rather than taining all Reana deployment.

Each Jupyter session is therefore spawned within the above nodes (by adding on the jhub-release manifest the `memory`, `nodeSelector` and `extraTolerations`, as showed below).
Resources have been assigned/organised without much experienced, based on the following
[zero-to-jupyterhub-k8s documentation](https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/main/docs/source/administrator/optimization.md#balancing-guaranteed-vs-maximum-memory-and-cpu).

```yaml
singleuser:
# cpu:
# limit: 4
# guarantee: 0.05
memory:
# ratio of 3:2 that is lower than the 2:1 suggested (on the above link)
# This should allow a
limit: 4G
guarantee: 2G

nodeSelector:
jupyter: singleuser
extraTolerations:
- key: jupyter
operator: Equal
value: singleuser
effect: NoSchedule
```
Notes:
- `cpu` requests have been commented. From the avobe link one can read on point num 3 **If you set resource limits but omit resource requests, then k8s will assume you imply the same resource requests as your limits. No assumptions are made in the other direction.**
- A ratio of `3.5:3` on `memory` will be set up on the cluster for the ET school happening on the 20 February 2024. To be elaborated / linked with a **leassons learned** post.
- `nodeSelector` and `extraTolerations` would need to be applied to the `eos` and `cephfs` `Daemonsets` too, so that they are not deployed all along the first 10 nodes.
- Investigate how the `prePuller` config and the `continuous-image-puller` pods can be reduced in a `nodeSelector` way. --> `Jhub` undestood that the image puller should only be on the nodes assigned for jupyter `:)`. Although there is a `Daemonset` that controls them.


### Patching the cluster
#### EOS
`Daemonset` `cern-magnum-eosxd` (`registry.cern.ch/magnum/eosd:4.8.51-1.2`) was deployed by default. The `Daemonset` was patched as follows

```bash
$ kubectl patch ds -n kube-system cern-magnum-eosxd --patch-file node_and_tolerations_jup.yaml
daemonset.apps/cern-magnum-eosxd patched
```
with `cat node_and_tolerations_jup.yaml`
```yaml
spec:
template:
spec:
nodeSelector:
jupyter: singleuser
tolerations:
- effect: NoSchedule
key: jupyter
operator: Equal
value: singleuser
```

**Open doubt** Not sure why the tolerations patch needs to be applied to the whole `ds`, as the `ds` itself it doesn't use these tolerations. However, the pods spawned by, contain both the `nodeSelecter` and the `tolerations`.


#### CVMFS
`Daemonset` `cvmfs-cvmfs-csi-nodeplugin` () was deployed into the cern-vre (not by default, manually if byt the k8s team, if I recall correctly), and patched as follows

```bash
$ kubectl patch ds -n kube-system cvmfs-cvmfs-csi-nodeplugin --patch-file node_and_tolerations_jup.yaml
daemonset.apps/cvmfs-cvmfs-csi-nodeplugin patched
```

#### CEPHFS
Done nothing for the moment - not sure if master nodes (or any monitoring) send/connect to CephFS for any reason.
47 changes: 0 additions & 47 deletions infrastructure/cluster/flux-v2/cvmfs/cvmfs.yaml

This file was deleted.

32 changes: 0 additions & 32 deletions infrastructure/cluster/flux-v2/dask/README.md

This file was deleted.

8 changes: 0 additions & 8 deletions infrastructure/cluster/flux-v2/dask/dask-charts.yaml

This file was deleted.

9 changes: 0 additions & 9 deletions infrastructure/cluster/flux-v2/dask/dask-gateway-chart.yaml

This file was deleted.

10 changes: 0 additions & 10 deletions infrastructure/cluster/flux-v2/dask/dask-gateway-configmap.yaml

This file was deleted.

60 changes: 0 additions & 60 deletions infrastructure/cluster/flux-v2/dask/dask-gateway-release.yaml

This file was deleted.

40 changes: 0 additions & 40 deletions infrastructure/cluster/flux-v2/dask/dask-release.yaml

This file was deleted.

6 changes: 0 additions & 6 deletions infrastructure/cluster/flux-v2/jhub/jhub-ns.yaml

This file was deleted.

Loading

0 comments on commit 61a83d8

Please sign in to comment.