Support for NVMe volumes #384

artem-zinnatullin · 2021-06-22T10:29:04Z

Hi!

We're looking for an automated way to provision PersistentVolumeClaims against locally mounted NVMe drives on DigitalOcean https://www.digitalocean.com/blog/introducing-storage-optimized-droplets-with-nvme-ssds/

We've tried local StorageClass https://kubernetes.io/docs/concepts/storage/storage-classes/#local, it does work however it is not automated at all, unlike DO Block Storage in k8s:

We have to manually create PerstistentVolumes
Each PersistentVolume has to be constrained to a particular node with nodeAffinity
Each PersistentVolume has to have capacity manually defined, however it does not act as a limit since NVMe storage is mounted as root / filesystem on Premium and Storage Optimized Droplets with NVMe
Each PersistentVolume must have only one assosiated PersistentVolumeClaim otherwise Pods using it will not be scheduled
Each new Node added to cluster will have to have PVs and PVCs configured, which defeats the benefit of k8s autoscaling.

We're looking into CSI implementations like https://github.com/minio/direct-csi, however major blocker there is that it only works with additional (non-root /) disks, but DigitalOcean Premium droplets use NVMe drive as root /.

The question is: can you consider adding support for DigitalOcean NVMe drives to csi-digitalocean please? :)

Thanks!

The text was updated successfully, but these errors were encountered:

adamwg · 2021-06-22T15:34:46Z

Hello,

We are considering adding support for dynamic provisioning of local storage volumes in DOKS, however it likely will not be implemented in this CSI driver.

The significant caveat to using node-local NVMe/SSD storage is that it is indeed node-local - we can't detach it from one node and attach it to another. This means it's really only useful for ephemeral purposes, since we expect nodes to be replaced in the course of normal cluster operations (e.g., due to health or for upgrade).

If you're able to share, I'd be interested to hear more about your use-case for local storage. We can connect over email if you'd rather discuss privately.

Thanks!

cc @bikram20

artem-zinnatullin · 2021-06-22T16:16:07Z

We are considering adding support for dynamic provisioning of local storage volumes in DOKS

That's great news!

however it likely will not be implemented in this CSI driver.

Interesting, how it'd be exposed and mounted then?

The significant caveat to using node-local NVMe/SSD storage is that it is indeed node-local - we can't detach it from one node and attach it to another. This means it's really only useful for ephemeral purposes, since we expect nodes to be replaced in the course of normal cluster operations (e.g., due to health or for upgrade).

We do understand this caveat. There are cases when it's fine, we want to run distributed Database on NVMe storage and distributed object store. Due to performance requirements we do want to use NVMes that DigitalOcean offers. In our case the applications are distributed meaning that a Node shutdown for say upgrades and is fine since other nodes will act as replicas, this is achieved via nodeAffinity rules in the app deployment so that pods of these apps are not running on same nodes that already have them running.

If you're able to share, I'd be interested to hear more about your use-case for local storage. We can connect over email if you'd rather discuss privately.

Let's continue publicly in this issue, there are very little public discussions on this topic so I'd like to use this thread as an opportunity to add more information on using local NVMe drives with Kubernetes to internet :)

adamwg · 2021-06-22T16:26:36Z

We are considering adding support for dynamic provisioning of local storage volumes in DOKS

That's great news!

however it likely will not be implemented in this CSI driver.

Interesting, how it'd be exposed and mounted then?

We would add an additional StorageClass with a separate provisioner, potentially leveraging an existing project like the direct-csi driver you linked. There's nothing DO-specific about node-local storage, so no need to add it to the DO CSI driver.

artem-zinnatullin · 2021-06-22T16:36:34Z

Sounds good!

artem-zinnatullin · 2021-06-23T07:39:06Z

Submitted related issue on partitioning NVMe drives for DOKS nodes digitalocean/DOKS#27, basically we can't repartition NVMe drive right now..

kainz · 2021-08-04T04:49:45Z

This sort of provisioning is also useful for running your own database workloads on nodes if you need something with the local nVME performance. Yes, the storage is 'ephemeral', but that is something database management tools like zalando or stolon can take into account, especially when combined with things like pod disruption budgets.

You can implement solutions for that need today by running self-managed k8s clusters alongside a managed one, but the administration workload also multiplies accordingly in that case. Managed DOKS as of 1.20 at least is almost there with the ability to run your so_1.5_* plan node pools. If you offered a way to allow a node pool to upgrade in-place, an operator needing to run a local datastore could run it entirely in managed DOKS.

In my particular usecase, I have clients who need to run PostgreSQL services with custom extensions and replication patterns, so that disqualifies most managed SQL offerings as well, thus my interest in closing the feature gaps in managing ephemeral storage on cloud instances/droplets.

kallisti5 · 2022-02-09T16:18:29Z

Hm. Vultr has been doing NVMe for a while as default for their Managed Kubernetes solution. This is a big difference with no additional cost.

bikram20 · 2022-02-14T03:49:55Z

@kallisti5 What kind of workloads are you looking to run on NVMe local storage? Would you be okay with ephemeral nodes? Nodes are recycled during release upgrade.

kallisti5 · 2022-02-14T13:10:55Z

@bikram20 Overall I'm trying to find a cost-effective way to leverage the standard DO instance sizes.

Running a reliable ReadWriteMany storage model is pretty difficult at Digital Ocean. My solution was longhorn storage (https://longhorn.io) since it maintains and grooms RWX replicas between all of the kubernetes nodes directly (using the massive amount of wasted space on each k8s node pool droplet saving costs (the 4vcpu / 8GiB nodes have over 100GiB which will go unused for most people using do's csi)). it also automatically backs up data to s3.

NVMe though would probably be the minimum requirement to maintain replicas within a reasonable timeframe.

DO really needs a managed storage solution that can do RWX like Gluster or NFS.

The workload itself is 300 GiB+ of software packages for Haiku (https://haiku-os.org) plus some other infrastructure.

AlbinoDrought · 2024-08-01T22:03:53Z

For others that are interested, a potential workaround is to mount file containers. Here's an example (original source):

File Container YAML

---
apiVersion: v1
kind: Namespace
metadata:
  name: xfs-disk-setup

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: xfs-disk-setup
  namespace: xfs-disk-setup
  labels:
    app: xfs-disk-setup
spec:
  selector:
    matchLabels:
      app: xfs-disk-setup
  template:
    metadata:
      labels:
        app: xfs-disk-setup
    spec:
      tolerations:
      - operator: Exists
      containers:
      - name: xfs-disk-setup
        image: docker.io/scylladb/local-csi-driver:latest
        imagePullPolicy: IfNotPresent
        command:
        - "/bin/bash"
        - "-euExo"
        - "pipefail"
        - "-O"
        - "inherit_errexit"
        - "-c"
        - |
          img_path="/host/var/persistent-volumes/persistent-volume.img"
          img_dir=$( dirname "${img_path}" )
          mount_path="/host/mnt/persistent-volumes"
          
          mkdir -p "${img_dir}"
          if [[ ! -f "${img_path}" ]]; then
            dd if=/dev/zero of="${img_path}" bs=1024 count=0 seek=10485760
          fi
          
          FS=$(blkid -o value -s TYPE "${img_path}" || true)
          if [[ "${FS}" != "xfs" ]]; then
            mkfs --type=xfs "${img_path}"
          fi
          
          mkdir -p "${mount_path}"
          
          remount_opt=""
          if mountpoint "${mount_path}"; then
            remount_opt="remount,"
          fi
          mount -t xfs -o "${remount_opt}prjquota" "${img_path}" "${mount_path}"
          
          sleep infinity
        securityContext:
          privileged: true
        volumeMounts:
        - name: hostfs
          mountPath: /host
          mountPropagation: Bidirectional
      volumes:
      - name: hostfs
        hostPath:
          path: /

You can then use https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner or any other local volume "provisioner" like normal.

The above DaemonSet creates a sparse file by default. To instead reserve the amount of space specified, try a syntax like dd if=/dev/zero of="${img_path}" bs=1M count=${img_size_mb} instead of dd if=/dev/zero of="${img_path}" bs=1024 count=0 seek=10485760.

I benchmarked this on s-2vcpu-4gb-120gb-intel with a non-spare file container mounted. Here's the fio config:

[read]
    direct=1
    bs=8k
    size=1G
    time_based=1
    runtime=240
    ioengine=libaio
    iodepth=32
    end_fsync=1
    log_avg_msec=1000
    directory=/data
    rw=read
    write_bw_log=read
    write_lat_log=read
    write_iops_log=read

and here's the results:

Storage Class	IOPS	BW
Local File Container
Digital Ocean Block Storage

The block storage benchmarks match what is currently listed on the Limits page (7500 IOPS * 8k blocksize = 60MB/s).

If you're able to share, I'd be interested to hear more about your use-case for local storage. We can connect over email if you'd rather discuss privately.

Not OP, but I'm interested in this for use with CloudNative-PG as an alternative to Managed Databases (we have different RPO requirements).

For what it's worth, here's our rudimentary pgbench results on CloudNative-PG using the above local file container vs managed database:

DB Init Select RW

CloudNative-PG

done in 356.09 s 
(drop tables 0.00 s, 
create tables 0.02 s, 
client-side generate 212.51 s, 
vacuum 10.92 s, 
primary keys 132.64 s).

pgbench (16.3 (Debian 16.3-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 8
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 152322
number of failed transactions: 0 (0.000%)
latency average = 1.571 ms
initial connection time = 90.086 ms
tps = 5092.207290 (without initial connection time)
Stream closed EOF for default/pgbench-run3ro-snc5r (pgbench)

pgbench (16.3 (Debian 16.3-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 64
number of threads: 64
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 37106
number of failed transactions: 0 (0.000%)
latency average = 50.980 ms
initial connection time = 602.344 ms
tps = 1255.393173 (without initial connection time)
Stream closed EOF for default/pgbench-run64x64-sd9hb (pgbench)

Managed (1x s-4gb-2vcpu)

done in 295.82 s 
(drop tables 0.00 s, 
create tables 0.00 s, 
client-side generate 187.63 s, 
vacuum 1.59 s, 
primary keys 106.59 s).

pgbench (16.3 (Debian 16.3-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 8
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 132639
number of failed transactions: 0 (0.000%)
latency average = 1.805 ms
initial connection time = 85.801 ms
tps = 4432.812902 (without initial connection time)
Stream closed EOF for default/pgbench-run3ro-cloud-6n5ss (pgbench)

pgbench (16.3 (Debian 16.3-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 64
number of threads: 64
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 25884
number of failed transactions: 0 (0.000%)
latency average = 72.894 ms
initial connection time = 691.792 ms
tps = 877.986444 (without initial connection time)
Stream closed EOF for default/pgbench-run64x64-cloud-v679b (pgbench)

artem-zinnatullin mentioned this issue Jun 23, 2021

Resizing root partition of a DOKS Worker Node Droplet digitalocean/DOKS#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for NVMe volumes #384

Support for NVMe volumes #384

artem-zinnatullin commented Jun 22, 2021

adamwg commented Jun 22, 2021

artem-zinnatullin commented Jun 22, 2021

adamwg commented Jun 22, 2021

artem-zinnatullin commented Jun 22, 2021

artem-zinnatullin commented Jun 23, 2021

kainz commented Aug 4, 2021

kallisti5 commented Feb 9, 2022

bikram20 commented Feb 14, 2022

kallisti5 commented Feb 14, 2022 •

edited

Loading

AlbinoDrought commented Aug 1, 2024 •

edited

Loading

Support for NVMe volumes #384

Support for NVMe volumes #384

Comments

artem-zinnatullin commented Jun 22, 2021

adamwg commented Jun 22, 2021

artem-zinnatullin commented Jun 22, 2021

adamwg commented Jun 22, 2021

artem-zinnatullin commented Jun 22, 2021

artem-zinnatullin commented Jun 23, 2021

kainz commented Aug 4, 2021

kallisti5 commented Feb 9, 2022

bikram20 commented Feb 14, 2022

kallisti5 commented Feb 14, 2022 • edited Loading

AlbinoDrought commented Aug 1, 2024 • edited Loading

kallisti5 commented Feb 14, 2022 •

edited

Loading

AlbinoDrought commented Aug 1, 2024 •

edited

Loading