-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes issue when multiple VRGs conflict with the PVCs being protected #1535
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
@asn1809 Please do not refer to downstream tickets/trackers/links, access maybe restricted or not generally available. Also in the commit message using the "Fixes" keyword would cause github to assume it is fixing an issue in this repository or organization, so avoiding that keyword at the bottom helps. The commit message also needs some massaging. |
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
6e0a946
to
ef385a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments after discussion with @raghavendra-talur needs to be addressed.
} | ||
|
||
// NewNamespaceLock returns new NamespaceLock | ||
func NewNamespaceLock() *NamespaceLock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavendra-talur : Pass ns string and use for struct init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just gave a thought on what we discussed. volumereplicationgroup_controller -- SetupWithManager (locks are initialized) r.locks = rmnutil.NewNamespaceLock() ns cannot be passed here as this call is well before Reconcile is invoked. So instead of string with ns, set of namespaces(locked) ones would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, it could be
type NamespaceLock struct {
mux map[string]sync.Mutex
}
// TryToAcquireLock tries to acquire the lock for processing VRG in a namespace having | ||
// multiple VRGs and returns true if successful. | ||
// If processing has already begun in the namespace, returns false. | ||
func (nl *NamespaceLock) TryToAcquireLock(namespace string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavendra-talur : Namespace arg is not required as it would be already acquired.
internal/controller/util/nslock.go
Outdated
// If processing has already begun in the namespace, returns false. | ||
func (nl *NamespaceLock) TryToAcquireLock(namespace string) bool { | ||
nl.mux.Lock() | ||
defer nl.mux.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavendra-talur : This should be removed.
internal/controller/util/nslock.go
Outdated
nl.mux.Lock() | ||
defer nl.mux.Unlock() | ||
|
||
if nl.namespace == namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavendra-talur : Handle this according to struct.
internal/controller/util/nslock.go
Outdated
|
||
// Release removes lock on the namespace | ||
func (nl *NamespaceLock) Release(namespace string) { | ||
nl.mux.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavendra-talur : Remove lock line.
|
||
var _ = Describe("Testing Locks", func() { | ||
nsLock := util.NewNamespaceLock() | ||
Expect(nsLock.TryToAcquireLock("test")).To(BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavendra-talur : May have to use go func or anonymous functions.
@@ -115,6 +116,8 @@ func (r *VolumeReplicationGroupReconciler) SetupWithManager( | |||
r.Log.Info("Kube object protection disabled; don't watch kube objects requests") | |||
} | |||
|
|||
r.locks = rmnutil.NewNamespaceLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass the vrg ns as arg.
|
||
// if the number of vrgs in the ns is more than 1, lock is needed. | ||
if len(vrgList.Items) > 1 { | ||
if isLockAcquired := v.reconciler.locks.TryToAcquireLock(ns); !isLockAcquired { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since TryToAcquireLock is a blocking call, not of response will not be helpful. Think on alternative.
// Acquiring lock failed, VRG reconcile should be requeued | ||
return err | ||
} | ||
defer v.reconciler.locks.Release(ns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The release of the locks should be done once PVC is verified for the ownership label.
@@ -437,6 +440,12 @@ func (r *VolumeReplicationGroupReconciler) Reconcile(ctx context.Context, req ct | |||
"Please install velero/oadp and restart the operator", v.instance.Namespace, v.instance.Name) | |||
} | |||
|
|||
err = v.vrgParallelProcessingCheck(adminNamespaceVRG) | |||
|
|||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The err will not be not nil anytime.
Previously we used VolumeReplication name and namespace for updating PVC conditions due to the fact that we had one-to-one relationship between VolumeReplication and PVC. Now when we are going to use consistency group this handling will not be right anymore, because several PVCs can be related to one VolumeReplication. So, we need to use PVC name and namespace for updating PVC conditions. Signed-off-by: Elena Gershkovich <[email protected]>
Signed-off-by: rakeshgm <[email protected]>
Using one item per line to make future changes easier. Signed-off-by: Nir Soffer <[email protected]>
Current python on Fedora 39 and macOS is 3.12, and 3.13 development version is available on github actions for a while. Add the versions so we can ensure compatibility with current and next python version. We keep python3.9 for compatibility with system using old python like RHEL 9. At some point we will want to drop old version and consume newer features in current python, but we don't have anything requiring this yet, so let try to be compatible. Signed-off-by: Nir Soffer <[email protected]>
For ramen, the capabilities for hub and cluster bundles currently states "Basic Install", Updating it to "Seamless Upgrades". Fixes: [Bug-2303823](https://bugzilla.redhat.com/show_bug.cgi?id=2303823) Signed-off-by: Abhijeet Shakya <[email protected]>
Signed-off-by: Abhijeet Shakya <[email protected]>
The changeset includes a new DeleteNamespaceManifestWork() func, which first checks if the mw.Spec has delete option or if it already has a DeletionTimestamp. Accordingly, it proceeds to delete the namespace manifestwork. It also updates the namespace manifestwork with the deleteOption and propogationPolicy of type orphan, whenever the createOrUpdateNamespaceManifest() func is called. Fixes: [Bug 2059669](https://bugzilla.redhat.com/show_bug.cgi?id=2059669) Signed-off-by: Abhijeet Shakya <[email protected]>
Signed-off-by: Abhijeet Shakya <[email protected]>
Signed-off-by: Raghavendra Talur <[email protected]>
Signed-off-by: Elena Gershkovich <[email protected]>
In newer controller-runtime version 0.19.0 the check for unique controller names. In our tests we are registering the same controller (drcluster controller) several times - in suite_test and in drcluster_mmode_test and in drcluster_drcconfig_tests. As a temporaty solution I added a flag for skipping the unique controller name validation. Another solution can be adding a name as a parameter for SetupWithManager function. Signed-off-by: Elena Gershkovich <[email protected]>
The issue where the "velero.io/exclude-from-backup" label was not applied originates from our deployment workflow. Initially, we deploy the workload, followed by enabling DR. During the first deployment, the Placement Operator creates the "PlacementDecision" with default labels. However, when "Ramen" tries to add the "velero.io/exclude-from-backup" label during DR setup, it skips because the "PlacementDecision" already exists. Consequently, "Velero" backs up the "PlacementDecision". And during hub recovery, it is restored without its status, leading to the unintended deletion of the workload. This situation only occurs when the current state wasn't updated before hub recovery was applied. The fix in this PR does not address the scenario where the workload is deployed, a hub backup is taken, DR is enabled, and then the hub is recovered before another backup is created. Fixes bug: 2308801 Signed-off-by: Benamar Mekhissi <[email protected]>
Fixing this error seen in the CI: drenv.commands.Error: Command failed: command: ('addons/argocd/test', 'rdr-hub', 'rdr-dr1', 'rdr-dr2') exitcode: 1 error: Traceback (most recent call last): ... drenv.commands.Error: Command failed: command: ('kubectl', 'delete', '--context', 'rdr-dr2', 'namespace', 'argocd-test', '--wait=false') exitcode: 1 error: Error from server (NotFound): namespaces "argocd-test" not found Signed-off-by: Nir Soffer <[email protected]>
This commit allows PVs to be resized by enabling the 'AllowVolumeExpansion' flag in the StorageClass. When this flag is set to true, users can dynamically adjust the size of PVCs by specifying the desired capacity in the PVC specification. Signed-off-by: rakeshgm <[email protected]>
Signed-off-by: rakeshgm <[email protected]>
We updated controller SDK to the 0.19 version, moving the envtest to the same version for required testing. This does not fix any test or such, it is just a hygine change. Signed-off-by: Shyamsundar Ranganathan <[email protected]>
Developers can enable it by copying the file from the hack directory instead of copying and pasting the code from docs/devel-quick-start.md and making the hook executable. Signed-off-by: Nir Soffer <[email protected]>
k8s.io/component-base should have been a direct dependency. Fixes the go mod error. Signed-off-by: Raghavendra Talur <[email protected]>
- Integrated the latest CRD from external-snapshotter. - Updated dependencies and relevant configuration files. - Changed VolumeSnapshotRefList in VolumeGroupSnapshotStatus to PVCVolumeSnapshotRefList, which allows to map a VolumeSnapshot to the application PVC. Signed-off-by: Benamar Mekhissi <[email protected]>
Signed-off-by: Benamar Mekhissi <[email protected]>
This is mainly to ensure that the VRG on the primary and secondary are in sync in regards to labels and annotations Signed-off-by: Benamar Mekhissi <[email protected]>
Was forgetten when we replaced minikube volumesnapshot addons. Signed-off-by: Nir Soffer <[email protected]>
The comment is still unclear, but at least easier to read. Signed-off-by: Nir Soffer <[email protected]>
- Replace "VR under deletion" with "deleted VR", matching the terminology used in the code. - Replace "detected as missing or deleted" with "is missing" when we know the VR does not exist. Signed-off-by: Nir Soffer <[email protected]>
Moved from validateVRStatus() to make room from checking VR VAlidated status. This also make the function easier to understand, keeping the same level of abstraction and getting rid of the uninteresting details. Signed-off-by: Nir Soffer <[email protected]>
Rename msg to errorMsg and document that this is an error message, if the we could not get the condition value, because it is missing, stale, or unknown. Signed-off-by: Nir Soffer <[email protected]>
Log after important changes to the system in delete VR flow to make it easier to understand what the system is doing, and how ramen changed the system. New logs: - delete the VR resource - remove annotations from PV Improve logs: - remove annotations, labels, and finalizers from PVC Signed-off-by: Nir Soffer <[email protected]>
When deleting a primary VRG, we wait until the VR Completed condition is met. However if a VR precondition failed, for example using a drpolicy without flattening enabled when the PVC needs flattening, the VR will never complete and the vrg and drpc deletion will never complete. Since csi-addons 0.10.0 we have a new Validated VR condition, set to true if pre conditions are met, and false if not. VR is can be deleted safely in this state, since mirroring was not enabled. This changes modifies deleted VRG processing to check the new VR Validated status. If the condition exist and the condition status is false, validateVRStatus() return true, signaling that the VR is in the desired state, and ramen completes the delete flow. If the VR does not report the Validated condition (e.g. old csi-addon version) or the condition status is true (mirroring in progress), we continue in the normal flow. The VR will be deleted only when the Completed condition status is true. Tested with discovered deployment and vm using a pvc created from a volume snapshot. Signed-off-by: Nir Soffer <[email protected]>
Signed-off-by: Abhijeet Shakya <[email protected]>
To refresh the cache we need to checkout ramen source and run drenv cache with the environment files. Using a workflow for this make this job easy to implement and manage without accessing the runner directly. The job can also run manually from github UI. This is likely to work for people with write access. Signed-off-by: Nir Soffer <[email protected]>
Signed-off-by: Elena Gershkovich <[email protected]>
Resolved an issue causing sporadic failures in the Ramen/VolSync related unit test due to delay creation of the PVC resource. Signed-off-by: Benamar Mekhissi <[email protected]>
Signed-off-by: Nir Soffer <[email protected]>
This is useful when you want to avoid drenv start getting stuck, and the environment does not provide a way to time out. Example run: % drenv start --timeout 60 envs/vm.yaml 2024-09-21 04:27:43,555 INFO [vm] Starting environment 2024-09-21 04:27:43,581 INFO [cluster] Starting lima cluster 2024-09-21 04:28:43,785 ERROR [cluster] did not receive an event with the "running" status 2024-09-21 04:28:43,790 ERROR Command failed Traceback (most recent call last): ... drenv.commands.Error: Command failed: command: ('limactl', '--log-format=json', 'start', '--timeout=60s', 'cluster') exitcode: 1 error: For lima provider, we pass the timeout to limactl. For minikube we use commands.watch() timeout. External provider does not use the timeout since we don't start the cluster. Signed-off-by: Nir Soffer <[email protected]>
On github actions we cannot start a vm with more than one cpu, and our test cluster is very small so it should work with 1g of ram. This makes the test cluster consume less resources if a developer leave it running for long time. Using 1 cpu conflicts with kubeadm preflight checks: [init] Using Kubernetes version: v1.31.0 ... error execution phase preflight: [preflight] Some fatal errors occurred: [ERROR NumCPU]: the number of available CPUs 1 is less than the required 2 [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...` Since we know we are doing, we suppress this preflight error. Running top shows that the cluster consumes only 9% cpu and 33.5% of the available memory. top - 04:51:02 up 10 min, 1 user, load average: 0.60, 0.29, 0.16 Tasks: 123 total, 1 running, 122 sleeping, 0 stopped, 0 zombie %Cpu(s): 6.2/3.1 9[|||||| ] MiB Mem : 33.5/1959.9 [||||||||||||||||||||| ] MiB Swap: 0.0/0.0 [ ] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4124 root 20 0 1449480 244352 68224 S 2.3 12.2 0:18.47 kube-apiserver 4120 root 20 0 1309260 102456 60800 S 0.7 5.1 0:06.17 kube-controller 4273 root 20 0 1902692 92040 59904 S 2.0 4.6 0:12.47 kubelet 4159 root 20 0 1288880 60156 44288 S 0.3 3.0 0:02.20 kube-scheduler 3718 root 20 0 1267660 57880 31020 S 0.3 2.9 0:09.28 containerd 4453 root 20 0 1289120 53540 42880 S 0.0 2.7 0:00.15 kube-proxy 5100 65532 20 0 1284608 49656 37504 S 0.0 2.5 0:01.14 coredns 5000 65532 20 0 1284608 49528 37376 S 0.0 2.5 0:00.86 coredns 4166 root 20 0 11.2g 47360 21504 S 1.3 2.4 0:08.69 etcd We could make the cluster even smaller, but kubeadm preflight check requires 1700 MiB, and we don't have memory issue in github or on developers machines. Signed-off-by: Nir Soffer <[email protected]>
We need rosetta only for complex setup when a component is not providing arm64 images. For testing drenv we have an empty cluster with busybox, and out busybox image is multi-arch. Disabling rosetta may speed up the the cluster, and this may be significant on github actions since the runners are about 6.5 times slower compared to M1 mac. Signed-off-by: Nir Soffer <[email protected]>
This is useful for running the tests on github runner, and also support one option for macOS. Signed-off-by: Nir Soffer <[email protected]>
Same optimization used in minikube. Minikube use 2 replicas only when using HA configuration. Signed-off-by: Nir Soffer <[email protected]>
It takes about 30 seconds until coredns is ready but we don't depend on it in the current code. Removing this wait we can start deploying 30 seconds earlier. This reduce the time for starting a cluster from 155 seconds to 125 seconds, and regional-dr environment from 450 to 420 seconds. Example run with vm environment: % drenv start envs/vm.yaml 2024-09-22 18:39:55,726 INFO [vm] Starting environment 2024-09-22 18:39:55,743 INFO [cluster] Starting lima cluster 2024-09-22 18:41:57,818 INFO [cluster] Cluster started in 122.07 seconds 2024-09-22 18:41:57,819 INFO [cluster/0] Running addons/example/start 2024-09-22 18:42:18,966 INFO [cluster/0] addons/example/start completed in 21.15 seconds 2024-09-22 18:42:18,966 INFO [cluster/0] Running addons/example/test 2024-09-22 18:42:19,120 INFO [cluster/0] addons/example/test completed in 0.15 seconds 2024-09-22 18:42:19,121 INFO [vm] Environment started in 143.40 seconds % drenv stop envs/vm.yaml 2024-09-22 18:42:44,244 INFO [vm] Stopping environment 2024-09-22 18:42:44,317 INFO [cluster] Stopping lima cluster 2024-09-22 18:42:44,578 WARNING [cluster] [hostagent] dhcp: unhandled message type: RELEASE 2024-09-22 18:42:49,441 INFO [cluster] Cluster stopped in 5.13 seconds 2024-09-22 18:42:49,441 INFO [vm] Environment stopped in 5.20 seconds % drenv start envs/vm.yaml 2024-09-22 18:42:53,132 INFO [vm] Starting environment 2024-09-22 18:42:53,156 INFO [cluster] Starting lima cluster 2024-09-22 18:43:34,436 INFO [cluster] Cluster started in 41.28 seconds 2024-09-22 18:43:34,437 INFO [cluster] Looking up failed deployments 2024-09-22 18:43:34,842 INFO [cluster/0] Running addons/example/start 2024-09-22 18:43:35,208 INFO [cluster/0] addons/example/start completed in 0.37 seconds 2024-09-22 18:43:35,208 INFO [cluster/0] Running addons/example/test 2024-09-22 18:43:35,371 INFO [cluster/0] addons/example/test completed in 0.16 seconds 2024-09-22 18:43:35,372 INFO [vm] Environment started in 42.24 seconds Signed-off-by: Nir Soffer <[email protected]>
- Use `/readyz` endpoint for waiting until kubernetes cluster is ready instead of `kubectl version`. This matches the way we wait for the cluster in drenv. - When waiting for coredns deployment, use `kubectl rollout status`, matching other code in drenv and addons. - Nicer probe description, visible in drenv when using --verbose. Signed-off-by: Nir Soffer <[email protected]>
We can ignore all ports for all protocols for all addresses using a simpler rule. With this change we see this log in --verbose mode: 2024-09-23 01:12:18,130 DEBUG [cluster] [hostagent] TCP (except for SSH) and UDP port forwarding is disabled And no "Not forwarding port ..." message. Previously we had lot of these messages[1]. This requires lima commit[2]. pull current master if you run older version. [1] lima-vm/lima#2577 [2] lima-vm/lima@9a09350 Signed-off-by: Nir Soffer <[email protected]>
And remove unneeded export - it was used to run kubectl commands before we setup root/.kube/config, but this step does not run any kubectl commands. Signed-off-by: Nir Soffer <[email protected]>
Provision scripts using `mode: system` run as root and do not need use sudo. Looks like the script were copied from code not running as root. Signed-off-by: Nir Soffer <[email protected]>
This is fixed now in lima[1] so we don't need to keep the fix in drenv. Developers should pull latest lima from git to use this fix. [1] lima-vm/lima#2632 Signed-off-by: Nir Soffer <[email protected]>
Added a check to stop relocation reconciliation if one of the clusters is unreachable. This prevents potential misclassification after hub recovery, which could lead to undesired results and inconsistencies. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2304182 Signed-off-by: Benamar Mekhissi <[email protected]>
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
Signed-off-by: Annaraya Narasagond <[email protected]> Signed-off-by: Annaraya Narasagond <[email protected]> Annaraya Narasagond <[email protected]>
Issue:
When multiple VRGs are created in the same namespace, PVCs having same labels will be protected by both the VRGs which might turn into error while restoring.
Idea in the PR is to: