Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Healer operation #40

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,4 @@ clean-deps:
$(MAKE) -C identity $@
$(MAKE) -C reclaimspace $@
$(MAKE) -C replication $@
$(MAKE) -C healer $@
20 changes: 20 additions & 0 deletions healer/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright 2022 The csi-addons Authors. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

PROTO := healer.proto
PROTO_SOURCE := README.md

all: install-deps $(PROTO) build

include ../release-tools/build.make
130 changes: 130 additions & 0 deletions healer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# CSI-Addons Operation: Healer

## Terminology

| Term | Definition |
| -------- | ------------------------------------------------------------------------------------- |
| VolumeID | The identifier of the volume generated by the plugin. |
| CO | Container Orchestration system that communicates with plugins using CSI service RPCs. |
| SP | Storage Provider, the vendor of a CSI plugin implementation. |
| RPC | [Remote Procedure Call](https://en.wikipedia.org/wiki/Remote_procedure_call). |

## Objective

Define a standard that will enable storage providers (SP) to
perform node level volume health check and healing operations.

### Goals in MVP

The new extension will define a procedure that

* can be called for existing volumes
* interacts with the Node-Plugin to check the health condition of the volume
* makes it possible for the SP to heal the volumes if they are in abnormal
condition

### Non-Goals in MVP

* Implementation of healing logic is OPTIONAL and completely SP specific

## Solution Overview

This specification defines an interface along with the minimum operational and
packaging recommendations for a storage provider (SP) to implement a
health check and heal operations for volumes. The interface declares the
RPCs that a plugin MUST expose.

## RPC Interface

* **Node Service**: The Node plugin MUST implement this RPC.

```protobuf
syntax = "proto3";
package healer;

import "github.com/container-storage-interface/spec/lib/go/csi/csi.proto";
import "google/protobuf/descriptor.proto";

option go_package = "github.com/csi-addons/spec/lib/go/healer";

// HealerNode holds the RPC method for running heal operations on the
// active (staged/published) volume.
service HealerNode {
// NodeHealer is a procedure that gets called on the CSI NodePlugin.
rpc NodeHealer (NodeHealerRequest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this HealVolume?

I would expect a way to check if a volume needs healing. The current NodeHealer call seems to do that, but also is expected to do the healing?

It would probably be useful to have a check, and only perform healing if the check returns that it is needed. This then can provide better feedback and tracking of actions that were performed. It might even be useful to fallback to some other recovery/healing mechanism in case volume-healing failed (extension for HealNode which might do a reboot or inform medik8s or similar?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixpanic
NodeHealer() is called on CSI NodePlugin, so it's up to SP to implement/perform healing or not, right? if so why do we need a flag/check?

I'm probably missing something? Is this flag set as part of the request or response?

  • If it is part of request then who will be setting it because we don't have any CRDs here.
  • If it is part of the response, may be based on the response CSI addon might want to do a reboot or similar?

Could you please make it more clear?
Also cannot we detect healing failed from the message string which is part of the response?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider two procedures:

  1. GetHealth() returning the state of the volume, possibly with a selection of recovery options (VolumeHealing or NodeHealing required)
  2. HealVolume() doing the actual healing, if it can heal the volume at all
  3. the CSI-Addons implementation for the CO can then implement a HealNode function (with Kubernetes medik8s might be an option)

This makes it easier to follow the actions, and improve the reporting to users (Kubernetes events maybe?).

CRDs are Kubernetes specific, so such definitions should go in the kubernetes-csi-addons repository instead. There probably should be a way for the CO to configure the checking interval, maybe even per volume. When using Kubernetes, an annotation on the PVC might be suitable for that (depending on the number of options).

returns (NodeHealerResponse) {}
}
```

### NodeHealer

```protobuf
// NodeHealerRequest contains the information needed to identify the
// location where the volume is mounted so that local filesystem or
// block-device operations to heal volume can be executed.
message NodeHealerRequest {
// The ID of the volume. This field is REQUIRED.
string volume_id = 1;

// The path on which volume is available. This field is REQUIRED.
// This field overrides the general CSI size limit.
// SP SHOULD support the maximum path length allowed by the operating
// system/filesystem, but, at a minimum, SP MUST accept a max path
// length of at least 128 bytes.
string volume_path = 2;

// The path where the volume is staged, if the plugin has the
// STAGE_UNSTAGE_VOLUME capability, otherwise empty.
// If not empty, it MUST be an absolute path in the root
// filesystem of the process serving this request.
// This field is OPTIONAL.
// This field overrides the general CSI size limit.
// SP SHOULD support the maximum path length allowed by the operating
// system/filesystem, but, at a minimum, SP MUST accept a max path
// length of at least 128 bytes.
string staging_target_path = 3;

// Volume capability describing how the CO intends to use this volume.
// This allows SP to determine if volume is being used as a block
// device or mounted file system. For example - if volume is being
// used as a block device the SP MAY choose to skip calling filesystem
// operations to healer. If volume_capability is omitted the SP MAY
// determine access_type from given volume_path for the volume and
// perform healing. This is an OPTIONAL field.
csi.v1.VolumeCapability volume_capability = 4;

// Secrets required by plugin to complete the healer operation.
// This field is OPTIONAL.
map<string, string> secrets = 5 [(csi.v1.csi_secret) = true];

// Volume context as returned by SP in
// CreateVolumeResponse.Volume.volume_context.
// This field is OPTIONAL and MUST match the volume_context of the
// volume identified by `volume_id`.
map<string, string> volume_context = 6;
}

// NodeHealerResponse holds the information about the result of the
// NodeHealerRequest call.
message NodeHealerResponse {
// Normal volumes are available for use and operating optimally.
// An abnormal volume does not meet these criteria.
// This field is REQUIRED.
bool abnormal = 1;

// The message describing the condition of the volume.
// This field is REQUIRED.
string message = 2;
}
```

#### NodeHealer Errors

| Condition | gRPC Code | Description | Recovery Behavior |
| ---------------------------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Missing required field | 3 INVALID_ARGUMENT | Indicates that a required field is missing from the request. | Caller MUST fix the request by adding the missing required field before retrying. |
| Volume does not exist | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Call not implemented | 12 UNIMPLEMENTED | The invoked RPC is not implemented by the CSI-driver or disabled in the driver's current mode of operation. | Caller MUST NOT retry. |
| Operation pending for volume | 10 ABORTED | Indicates that there is already an operation pending for the specified `volume_id`. In general the CSI-Addons CO plugin is responsible for ensuring that there is no more than one call "in-flight" per `volume_id` at a given time. However, in some circumstances, the CSI-Addons CO plugin MAY lose state (for example when the it crashes and restarts), and MAY issue multiple calls simultaneously for the same `volume_id`. The CSI-driver, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified `volume_id`, and then retry with exponential back off. |
| Not authenticated | 16 UNAUTHENTICATED | The invoked RPC does not carry secrets that are valid for authentication. | Caller SHALL either fix the secrets provided in the RPC, or otherwise regalvanize said secrets such that they will pass authentication by the Plugin for the attempted RPC, after which point the caller MAY retry the attempted RPC. |
| Error is Unknown | 2 UNKNOWN | Indicates that a unknown error is generated | Caller MUST study the logs before retrying |
73 changes: 73 additions & 0 deletions healer/healer.proto
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
// Code generated by make; DO NOT EDIT.
syntax = "proto3";
package healer;

import "github.com/container-storage-interface/spec/lib/go/csi/csi.proto";
import "google/protobuf/descriptor.proto";

option go_package = "github.com/csi-addons/spec/lib/go/healer";

// HealerNode holds the RPC method for running heal operations on the
// active (staged/published) volume.
service HealerNode {
// NodeHealer is a procedure that gets called on the CSI NodePlugin.
rpc NodeHealer (NodeHealerRequest)
returns (NodeHealerResponse) {}
}
// NodeHealerRequest contains the information needed to identify the
// location where the volume is mounted so that local filesystem or
// block-device operations to heal volume can be executed.
message NodeHealerRequest {
// The ID of the volume. This field is REQUIRED.
string volume_id = 1;

// The path on which volume is available. This field is REQUIRED.
// This field overrides the general CSI size limit.
// SP SHOULD support the maximum path length allowed by the operating
// system/filesystem, but, at a minimum, SP MUST accept a max path
// length of at least 128 bytes.
string volume_path = 2;

// The path where the volume is staged, if the plugin has the
// STAGE_UNSTAGE_VOLUME capability, otherwise empty.
// If not empty, it MUST be an absolute path in the root
// filesystem of the process serving this request.
// This field is OPTIONAL.
// This field overrides the general CSI size limit.
// SP SHOULD support the maximum path length allowed by the operating
// system/filesystem, but, at a minimum, SP MUST accept a max path
// length of at least 128 bytes.
string staging_target_path = 3;

// Volume capability describing how the CO intends to use this volume.
// This allows SP to determine if volume is being used as a block
// device or mounted file system. For example - if volume is being
// used as a block device the SP MAY choose to skip calling filesystem
// operations to healer. If volume_capability is omitted the SP MAY
// determine access_type from given volume_path for the volume and
// perform healing. This is an OPTIONAL field.
csi.v1.VolumeCapability volume_capability = 4;

// Secrets required by plugin to complete the healer operation.
// This field is OPTIONAL.
map<string, string> secrets = 5 [(csi.v1.csi_secret) = true];

// Volume context as returned by SP in
// CreateVolumeResponse.Volume.volume_context.
// This field is OPTIONAL and MUST match the volume_context of the
// volume identified by `volume_id`.
map<string, string> volume_context = 6;
}

// NodeHealerResponse holds the information about the result of the
// NodeHealerRequest call.
message NodeHealerResponse {
// Normal volumes are available for use and operating optimally.
// An abnormal volume does not meet these criteria.
// This field is REQUIRED.
bool abnormal = 1;

// The message describing the condition of the volume.
// This field is REQUIRED.
string message = 2;
}
23 changes: 23 additions & 0 deletions identity/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,27 @@ message Capability {
Type type = 1;
}

// Healer contains the features of the Healer operation that the
// CSI-driver supports.
message Healer {
// Type describes a CSI Service that CSI-drivers can support.
enum Type {
// UNKNOWN indicates that the CSI-driver does not support the Healer
// operation in the current mode. The CSI-Addons CO plugin will most
// likely ignore this node for the Healer operation.
UNKNOWN = 0;

// HEALER indicates that the CSI-driver provides RPCs for a
// Healer operation.
// The presence of this capability determines whether the CSI-Addons CO
// plugin can invoke RPCs that require access to the storage system,
// similar to the CSI Controller (provisioner).
HEALER = 1;
}
// type contains the Type of CSI Service that the CSI-driver supports.
Type type = 1;
}

// Additional CSI-Addons operations will need to be added here.

oneof type {
Expand All @@ -248,6 +269,8 @@ message Capability {
ReclaimSpace reclaim_space = 2;
// NetworkFence operation capabilities
NetworkFence network_fence = 3;
// Healer operation capabilities
Healer healer = 4;

// Additional CSI-Addons operations need to be appended to this list.
}
Expand Down
23 changes: 23 additions & 0 deletions identity/identity.proto
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,27 @@ message Capability {
Type type = 1;
}

// Healer contains the features of the Healer operation that the
// CSI-driver supports.
message Healer {
// Type describes a CSI Service that CSI-drivers can support.
enum Type {
// UNKNOWN indicates that the CSI-driver does not support the Healer
// operation in the current mode. The CSI-Addons CO plugin will most
// likely ignore this node for the Healer operation.
UNKNOWN = 0;

// HEALER indicates that the CSI-driver provides RPCs for a
// Healer operation.
// The presence of this capability determines whether the CSI-Addons CO
// plugin can invoke RPCs that require access to the storage system,
// similar to the CSI Controller (provisioner).
HEALER = 1;
}
// type contains the Type of CSI Service that the CSI-driver supports.
Type type = 1;
}

// Additional CSI-Addons operations will need to be added here.

oneof type {
Expand All @@ -148,6 +169,8 @@ message Capability {
ReclaimSpace reclaim_space = 2;
// NetworkFence operation capabilities
NetworkFence network_fence = 3;
// Healer operation capabilities
Healer healer = 4;

// Additional CSI-Addons operations need to be appended to this list.
}
Expand Down
Loading