Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the replication troubleshooting and recommended mode for opeartor upgrade #1383

Merged
merged 2 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions content/docs/deployment/csmoperator/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,8 @@ The `Update approval` (**`InstallPlan`** in OLM terms) strategy plays a role whi

>NOTE: The recommended version of OLM for Upstream Kubernetes is **`v0.25.0`**.

>NOTE: The recommended Update Approval is **`Manual`** to prevent the inatsllation of non-qualified versions of operator.

#### Using Installation Script

1. Clone and checkout the required csm-operator version using
Expand Down
15 changes: 8 additions & 7 deletions content/docs/replication/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,16 @@ description: >
---

| Symptoms | Prevention, Resolution or Workaround |
| --- | --- |
| --- | --- |
| Persistent volumes don't get created on the target cluster. | Run `kubectl describe` on one of the pods of replication controller and see if event says `Config update won't be applied because of invalid configmap/secrets. Please fix the invalid configuration`. If it does, then ensure you correctly populated replication ConfigMap. You can check the current status by running `kubectl describe cm -n dell-replication-controller dell-replication-controller-config`. If ConfigMap is empty, please edit it yourself or use `repctl cluster inject` command. |
| Persistent volumes don't get created on the target cluster. You don't see any events on the replication-controller pod. | Check logs of replication controller by running `kubectl logs -n dell-replication-controller dell-replication-controller-manager-<generated-symbols>`. If you see `clusterId - <clusterID> not found` errors then be sure to check if you specified the same clusterIDs in both your ConfigMap and replication enabled StorageClass. |
| You apply replication action by manually editing ReplicationGroup resource field `spec.action` and don't see any change of ReplicationGroup state after a while. | Check events of the replication-controller pod, if it says `Cannot proceed with action <your-action>. [unsupported action]` then check spelling of your action and consult the [Replication Actions](../replication-actions) page. Alternatively, you can use `repctl` instead of manually editing ReplicationGroup resources. |
| You execute failover action using `repctl failover` command and see `failover: error executing failover to source site`. | This means you tried to failover to a cluster that is already marked source. If you still want to execute failover for RG, just choose another cluster. |
| You've created PersistentVolumeClaim using replication enabled StorageClass but don't see any RGs created in the source cluster. | Check annotations of created PersistentVolumeClaim. If it doesn't have `annotations` that start with `replication.storage.dell.com` then please wait for a couple of minutes for them to be added and RG to be created. |
| Persistent volumes don't get created on the target cluster. You don't see any events on the replication-controller pod. | Check logs of replication controller by running `kubectl logs -n dell-replication-controller dell-replication-controller-manager-<generated-symbols>`. If you see `clusterId - <clusterID> not found` errors then be sure to check if you specified the same clusterIDs in both your ConfigMap and replication enabled StorageClass. |
| You apply replication action by manually editing ReplicationGroup resource field `spec.action` and don't see any change of ReplicationGroup state after a while. | Check events of the replication-controller pod, if it says `Cannot proceed with action <your-action>. [unsupported action]` then check spelling of your action and consult the [Replication Actions](../replication-actions) page. Alternatively, you can use `repctl` instead of manually editing ReplicationGroup resources. |
| You execute failover action using `repctl failover` command and see `failover: error executing failover to source site`. | This means you tried to failover to a cluster that is already marked source. If you still want to execute failover for RG, just choose another cluster. |
| You've created PersistentVolumeClaim using replication enabled StorageClass but don't see any RGs created in the source cluster. | Check annotations of created PersistentVolumeClaim. If it doesn't have `annotations` that start with `replication.storage.dell.com` then please wait for a couple of minutes for them to be added and RG to be created. |
| When installing common replication controller using helm you see an error that states `invalid ownership metadata` and `missing key "app.kubernetes.io/managed-by": must be set to "Helm"` | This means that you haven't fully deleted the previous release, you can fix it by either deleting entire manifest by using `kubectl delete -f deploy/controller.yaml` or manually deleting conflicting resources (ClusterRoles, ClusterRoleBinding, etc.) |
| PV and/or PVCs are not being created at the source/target cluster. If you check the controller's logs you can see `no such host` errors| Make sure cluster-1's API is pingable from cluster-2 and vice versa. If one of your clusters is OpenShift located in a private network and needs records in /etc/hosts, `exec` into controller pod and modify `/etc/hosts` manually. |
| After upgrading to Replication v1.4.0, if `kubectl get rg` returns an error `Unable to list "replication.storage.dell.com/v1alpha1, Resource=dellcsireplicationgroups"`| This means `kubectl` still doesn't recognize the new version of CRD `dellcsireplicationgroups.replication.storage.dell.com` after upgrade. Running the command `kubectl get DellCSIReplicationGroup.v1.replication.storage.dell.com/<rg-id> -o yaml` will resolve the issue. |
| To add or delete PV s in the existing SYNC Replication Group in PowerStore, you may encounter the error `The operation is restricted as sync replication session for resource <Replication Group Name> is not paused` | To resolve this, you need to pause the replication group, add the PV, and then resume the replication group (RG). The commands for the pause and resume operations are: `repctl --rg <rg-id> exec -a suspend` `repctl --rg <rg-id> exec -a resume` |
| To delete the last volume from the existing SYNC Replication Group in Powerstore, you may encounter the error 'failed to remove volume from volume group: The operation cannot be completed on metro or replicated volume group because volume group will become empty after last members are removed' | To resolve this, unassign the protection policy from the corresponding volume group on the PowerStore Manager UI. After that, you can successfully delete the last volume in that SYNC Replication Group.|
| When running CSI-PowerMax with Replication in a multi-cluster configuration, the driver on the target cluster fails and the following error is seen in logs: `error="CSI reverseproxy service host or port not found, CSI reverseproxy not installed properly"` | The reverseproxy service needs to be created manually on the target cluster. Follow [the instructions here](../../deployment/csmoperator/modules/replication#configuration-steps) to create it.|
| To delete the last volume from the existing SYNC Replication Group in Powerstore, you may encounter the error 'failed to remove volume from volume group: The operation cannot be completed on metro or replicated volume group because volume group will become empty after last members are removed' | To resolve this, unassign the protection policy from the corresponding volume group on the PowerStore Manager UI. After that, you can successfully delete the last volume in that SYNC Replication Group. |
| When running CSI-PowerMax with Replication in a multi-cluster configuration, the driver on the target cluster fails and the following error is seen in logs: `error="CSI reverseproxy service host or port not found, CSI reverseproxy not installed properly"` | The reverseproxy service needs to be created manually on the target cluster. Follow [the instructions here](../../deployment/csmoperator/modules/replication#configuration-steps) to create it. |
| When getting the following error for CSi-Powerscale with Replication with encryption enabled: `SyncIQ policy failed to establish an encrypted connection`, the Replication groups and PVC's won't be created at target cluster. | The `encryption required` flag in the SyncIQ settings was set to "yes" by default in OneFS 9.0+. To rectify this error, please follow the following article: <https://www.dell.com/support/kbdoc/en-us/000215174/isilon-synciq-9-0-all-policies-fail-when-source-or-target-cluster-is-on-onefs-9-0-with-no-node-on-source-cluster-was-able-to-connect-to-target-cluster> |
Loading