-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Healer operation #40
Open
pkalever
wants to merge
3
commits into
csi-addons:main
Choose a base branch
from
pkalever:healer
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,3 +25,4 @@ clean-deps: | |
$(MAKE) -C identity $@ | ||
$(MAKE) -C reclaimspace $@ | ||
$(MAKE) -C replication $@ | ||
$(MAKE) -C healer $@ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Copyright 2022 The csi-addons Authors. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
PROTO := healer.proto | ||
PROTO_SOURCE := README.md | ||
|
||
all: install-deps $(PROTO) build | ||
|
||
include ../release-tools/build.make |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
# CSI-Addons Operation: Healer | ||
|
||
## Terminology | ||
|
||
| Term | Definition | | ||
| -------- | ------------------------------------------------------------------------------------- | | ||
| VolumeID | The identifier of the volume generated by the plugin. | | ||
| CO | Container Orchestration system that communicates with plugins using CSI service RPCs. | | ||
| SP | Storage Provider, the vendor of a CSI plugin implementation. | | ||
| RPC | [Remote Procedure Call](https://en.wikipedia.org/wiki/Remote_procedure_call). | | ||
|
||
## Objective | ||
|
||
Define a standard that will enable storage providers (SP) to | ||
perform node level volume health check and healing operations. | ||
|
||
### Goals in MVP | ||
|
||
The new extension will define a procedure that | ||
|
||
* can be called for existing volumes | ||
* interacts with the Node-Plugin to check the health condition of the volume | ||
* makes it possible for the SP to heal the volumes if they are in abnormal | ||
condition | ||
|
||
### Non-Goals in MVP | ||
|
||
* Implementation of healing logic is OPTIONAL and completely SP specific | ||
|
||
## Solution Overview | ||
|
||
This specification defines an interface along with the minimum operational and | ||
packaging recommendations for a storage provider (SP) to implement a | ||
health check and heal operations for volumes. The interface declares the | ||
RPCs that a plugin MUST expose. | ||
|
||
## RPC Interface | ||
|
||
* **Node Service**: The Node plugin MUST implement this RPC. | ||
|
||
```protobuf | ||
syntax = "proto3"; | ||
package healer; | ||
|
||
import "github.com/container-storage-interface/spec/lib/go/csi/csi.proto"; | ||
import "google/protobuf/descriptor.proto"; | ||
|
||
option go_package = "github.com/csi-addons/spec/lib/go/healer"; | ||
|
||
// HealerNode holds the RPC method for running heal operations on the | ||
// active (staged/published) volume. | ||
service HealerNode { | ||
// NodeHealer is a procedure that gets called on the CSI NodePlugin. | ||
rpc NodeHealer (NodeHealerRequest) | ||
returns (NodeHealerResponse) {} | ||
} | ||
``` | ||
|
||
### NodeHealer | ||
|
||
```protobuf | ||
// NodeHealerRequest contains the information needed to identify the | ||
// location where the volume is mounted so that local filesystem or | ||
// block-device operations to heal volume can be executed. | ||
message NodeHealerRequest { | ||
// The ID of the volume. This field is REQUIRED. | ||
string volume_id = 1; | ||
|
||
// The path on which volume is available. This field is REQUIRED. | ||
// This field overrides the general CSI size limit. | ||
// SP SHOULD support the maximum path length allowed by the operating | ||
// system/filesystem, but, at a minimum, SP MUST accept a max path | ||
// length of at least 128 bytes. | ||
string volume_path = 2; | ||
|
||
// The path where the volume is staged, if the plugin has the | ||
// STAGE_UNSTAGE_VOLUME capability, otherwise empty. | ||
// If not empty, it MUST be an absolute path in the root | ||
// filesystem of the process serving this request. | ||
// This field is OPTIONAL. | ||
// This field overrides the general CSI size limit. | ||
// SP SHOULD support the maximum path length allowed by the operating | ||
// system/filesystem, but, at a minimum, SP MUST accept a max path | ||
// length of at least 128 bytes. | ||
string staging_target_path = 3; | ||
|
||
// Volume capability describing how the CO intends to use this volume. | ||
// This allows SP to determine if volume is being used as a block | ||
// device or mounted file system. For example - if volume is being | ||
// used as a block device the SP MAY choose to skip calling filesystem | ||
// operations to healer. If volume_capability is omitted the SP MAY | ||
// determine access_type from given volume_path for the volume and | ||
// perform healing. This is an OPTIONAL field. | ||
csi.v1.VolumeCapability volume_capability = 4; | ||
|
||
// Secrets required by plugin to complete the healer operation. | ||
// This field is OPTIONAL. | ||
map<string, string> secrets = 5 [(csi.v1.csi_secret) = true]; | ||
|
||
// Volume context as returned by SP in | ||
// CreateVolumeResponse.Volume.volume_context. | ||
// This field is OPTIONAL and MUST match the volume_context of the | ||
// volume identified by `volume_id`. | ||
map<string, string> volume_context = 6; | ||
} | ||
|
||
// NodeHealerResponse holds the information about the result of the | ||
// NodeHealerRequest call. | ||
message NodeHealerResponse { | ||
// Normal volumes are available for use and operating optimally. | ||
// An abnormal volume does not meet these criteria. | ||
// This field is REQUIRED. | ||
bool abnormal = 1; | ||
|
||
// The message describing the condition of the volume. | ||
// This field is REQUIRED. | ||
string message = 2; | ||
} | ||
``` | ||
|
||
#### NodeHealer Errors | ||
|
||
| Condition | gRPC Code | Description | Recovery Behavior | | ||
| ---------------------------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| Missing required field | 3 INVALID_ARGUMENT | Indicates that a required field is missing from the request. | Caller MUST fix the request by adding the missing required field before retrying. | | ||
| Volume does not exist | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. | | ||
| Call not implemented | 12 UNIMPLEMENTED | The invoked RPC is not implemented by the CSI-driver or disabled in the driver's current mode of operation. | Caller MUST NOT retry. | | ||
| Operation pending for volume | 10 ABORTED | Indicates that there is already an operation pending for the specified `volume_id`. In general the CSI-Addons CO plugin is responsible for ensuring that there is no more than one call "in-flight" per `volume_id` at a given time. However, in some circumstances, the CSI-Addons CO plugin MAY lose state (for example when the it crashes and restarts), and MAY issue multiple calls simultaneously for the same `volume_id`. The CSI-driver, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified `volume_id`, and then retry with exponential back off. | | ||
| Not authenticated | 16 UNAUTHENTICATED | The invoked RPC does not carry secrets that are valid for authentication. | Caller SHALL either fix the secrets provided in the RPC, or otherwise regalvanize said secrets such that they will pass authentication by the Plugin for the attempted RPC, after which point the caller MAY retry the attempted RPC. | | ||
| Error is Unknown | 2 UNKNOWN | Indicates that a unknown error is generated | Caller MUST study the logs before retrying | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
// Code generated by make; DO NOT EDIT. | ||
syntax = "proto3"; | ||
package healer; | ||
|
||
import "github.com/container-storage-interface/spec/lib/go/csi/csi.proto"; | ||
import "google/protobuf/descriptor.proto"; | ||
|
||
option go_package = "github.com/csi-addons/spec/lib/go/healer"; | ||
|
||
// HealerNode holds the RPC method for running heal operations on the | ||
// active (staged/published) volume. | ||
service HealerNode { | ||
// NodeHealer is a procedure that gets called on the CSI NodePlugin. | ||
rpc NodeHealer (NodeHealerRequest) | ||
returns (NodeHealerResponse) {} | ||
} | ||
// NodeHealerRequest contains the information needed to identify the | ||
// location where the volume is mounted so that local filesystem or | ||
// block-device operations to heal volume can be executed. | ||
message NodeHealerRequest { | ||
// The ID of the volume. This field is REQUIRED. | ||
string volume_id = 1; | ||
|
||
// The path on which volume is available. This field is REQUIRED. | ||
// This field overrides the general CSI size limit. | ||
// SP SHOULD support the maximum path length allowed by the operating | ||
// system/filesystem, but, at a minimum, SP MUST accept a max path | ||
// length of at least 128 bytes. | ||
string volume_path = 2; | ||
|
||
// The path where the volume is staged, if the plugin has the | ||
// STAGE_UNSTAGE_VOLUME capability, otherwise empty. | ||
// If not empty, it MUST be an absolute path in the root | ||
// filesystem of the process serving this request. | ||
// This field is OPTIONAL. | ||
// This field overrides the general CSI size limit. | ||
// SP SHOULD support the maximum path length allowed by the operating | ||
// system/filesystem, but, at a minimum, SP MUST accept a max path | ||
// length of at least 128 bytes. | ||
string staging_target_path = 3; | ||
|
||
// Volume capability describing how the CO intends to use this volume. | ||
// This allows SP to determine if volume is being used as a block | ||
// device or mounted file system. For example - if volume is being | ||
// used as a block device the SP MAY choose to skip calling filesystem | ||
// operations to healer. If volume_capability is omitted the SP MAY | ||
// determine access_type from given volume_path for the volume and | ||
// perform healing. This is an OPTIONAL field. | ||
csi.v1.VolumeCapability volume_capability = 4; | ||
|
||
// Secrets required by plugin to complete the healer operation. | ||
// This field is OPTIONAL. | ||
map<string, string> secrets = 5 [(csi.v1.csi_secret) = true]; | ||
|
||
// Volume context as returned by SP in | ||
// CreateVolumeResponse.Volume.volume_context. | ||
// This field is OPTIONAL and MUST match the volume_context of the | ||
// volume identified by `volume_id`. | ||
map<string, string> volume_context = 6; | ||
} | ||
|
||
// NodeHealerResponse holds the information about the result of the | ||
// NodeHealerRequest call. | ||
message NodeHealerResponse { | ||
// Normal volumes are available for use and operating optimally. | ||
// An abnormal volume does not meet these criteria. | ||
// This field is REQUIRED. | ||
bool abnormal = 1; | ||
|
||
// The message describing the condition of the volume. | ||
// This field is REQUIRED. | ||
string message = 2; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this
HealVolume
?I would expect a way to check if a volume needs healing. The current
NodeHealer
call seems to do that, but also is expected to do the healing?It would probably be useful to have a check, and only perform healing if the check returns that it is needed. This then can provide better feedback and tracking of actions that were performed. It might even be useful to fallback to some other recovery/healing mechanism in case volume-healing failed (extension for
HealNode
which might do a reboot or inform medik8s or similar?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nixpanic
NodeHealer() is called on CSI NodePlugin, so it's up to SP to implement/perform healing or not, right? if so why do we need a flag/check?
I'm probably missing something? Is this flag set as part of the request or response?
Could you please make it more clear?
Also cannot we detect healing failed from the message string which is part of the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider two procedures:
GetHealth()
returning the state of the volume, possibly with a selection of recovery options (VolumeHealing
orNodeHealing
required)HealVolume()
doing the actual healing, if it can heal the volume at allHealNode
function (with Kubernetes medik8s might be an option)This makes it easier to follow the actions, and improve the reporting to users (Kubernetes events maybe?).
CRDs are Kubernetes specific, so such definitions should go in the kubernetes-csi-addons repository instead. There probably should be a way for the CO to configure the checking interval, maybe even per volume. When using Kubernetes, an annotation on the PVC might be suitable for that (depending on the number of options).