nsfs - monitor only nsrs that are mounted. DFBUGS-153 #8561

alphaprinz · 2024-11-28T07:11:59Z

Explain the changes

Namespace monitor should not monitor nsfs nsrs that are not mounted on the endpoint.
Use the new NSFS_NSR_ env variable to test whether nsr should be mounted.
(see nsfs - add mounted nsr name to env. DFBUGS-153 noobaa-operator#1481)

This commit reverts 2789d60.

Issues: Fixed #xxx / Gap #xxx

https://issues.redhat.com/browse/DFBUGS-153

Testing Instructions:

Reduce config.NAMESPACE_MONITOR_DELAY to 1000ms.
Create nsfs nsr
nsr should not be rejected (by endpoints existing before its creation).

Doc added/updated
Tests added

romayalon · 2024-12-03T08:36:11Z

src/endpoint/endpoint.js

+            background_scheduler.register_bg_worker(new NamespaceMonitor({
+                name: 'namespace_fs_monitor',
+                client: internal_rpc_client,
+                should_monitor: nsr => Boolean(nsr.nsfs_config && process.env['NSFS_NSR_' + nsr.name]),


A few questions -

If the endpoint got up before the namespace resource was mounted, when is the next time we will get to this flow for start monitoring? why not add a retry after 60 seconds -
nsfs | wait for endpoint startup before namespace monitor registration #8474 (comment)

Why avoid start monitoring instead of externalizing that the value of process.env['NSFS_NSR_' + nsr.name] is undefined which means that the PV was not mounted yet?

An endpoint that was started before the mount will be deleted after a new endpoint is created with the new mount.
The retry will not help as the endpoint that opened the report is removed when the new nsfs nsr mount is added (after nsfs nsr was created in kubernetes cluster).

There could be other nsfs nsrs that should be monitored (correct me if I'm wrong).

Maybe I'll make the scenario more concrete-

Operator install a system in a cluster.

There is endpoint A. It does NOT have any nsfs nsr mounts.

At some point, an nsfs nsr is created in the cluster.

In reconcile:

a. operator adds a mount for the nsfs nsr to endpoints' container.

b. operator creates an nsr object in system store.

A new endpoint B with the new mount is created by kubernetes.

While B is being created, A updates its system store, reads the nsfs nsr. The new nsfs nsr is NOT mounted in A. A reports NOENT on the nsfs nsr. Note since default interval for nsfs nsr monitoring is less than creating a new endpoint, this doesn't necessarily happen. Reducing config.NAMESPACE_MONITOR_DELAY will ensure bug reproduction.

Endpoint B is ready. Endpoint A is deleted. Nsr status is stuck in rejected.

As I mentioned on Slack, I think that the correct path is not to avoid monitoring a namespace resource that is still not mounted but add this check to the monitoring process.

The nsr will be monitored by the new endpoint.
The old endpoint is about to be deleted.
The only difference this commit makes is that old endpoints won't mistakenly report nsr as rejected.

If you think that the about-to-be-deleted endpoint should do something about the mount it will never have (or anything else, for that matter) please specify it explicitly. The current "add this to monitoring process" is too vague. Also specify explicitly if this is an enhancement or part of the bug fix.

"About-to-be-deleted" is the happy path :)
There is also the sad path where there is an issue with the mounting and it takes a while/never happens - that's exactly why I think it's important and avoiding monitoring it if it's not mounted is a partial solution from my prespective.

I'm not trying to solve monitoring, but rather to fix a bug in monitoring.
I'm not removing any feature that we currently have.

Again, I would like a more specific way to proceed.
If you think a different fix or an enhancement to the monitoring is needed, please specify it explicitly.

@alphaprinz
I already explained it in the above comment, but I'll be happy to summarize my comments -

My specific idea for solving it -
Instead of not monitoring unmounted namespace resources, I think you should move the new condition you added inside the monitoring check, and externalize that this is the current issue that the namespace resource has.
Comment 1, bullet 2
Comment 3

Why I think that my suggestion is a better behavior / user experience -
It will behave better in cases where the re-start of the endpoint takes time/won't happen at all.
Comment 5

How to proceed -
The above summary of comments is my opinion/suggestion/how I would fix it.
IMO, You shall proceed from here as how you see it, fix it, open an issue and call it enhancement, document this gap or anything else you feel appropriate.

Signed-off-by: Amit Prinz Setter <[email protected]>

alphaprinz requested review from romayalon and shirady November 28, 2024 07:11

pull-request-size bot added the size/S label Nov 28, 2024

alphaprinz force-pushed the 153_nsfs_nsr_rejected_take2 branch 2 times, most recently from 3234a46 to a7c5ed9 Compare December 2, 2024 19:14

romayalon reviewed Dec 3, 2024

View reviewed changes

nsfs - monitor only nsrs that are mounted. DFBUGS-153

50c80bf

Signed-off-by: Amit Prinz Setter <[email protected]>

alphaprinz force-pushed the 153_nsfs_nsr_rejected_take2 branch from a7c5ed9 to 50c80bf Compare December 4, 2024 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsfs - monitor only nsrs that are mounted. DFBUGS-153 #8561

nsfs - monitor only nsrs that are mounted. DFBUGS-153 #8561

alphaprinz commented Nov 28, 2024 •

edited

Loading

romayalon Dec 3, 2024

alphaprinz Dec 3, 2024

romayalon Dec 4, 2024

alphaprinz Dec 4, 2024 •

edited

Loading

romayalon Dec 5, 2024 •

edited

Loading

alphaprinz Dec 6, 2024

romayalon Dec 8, 2024

nsfs - monitor only nsrs that are mounted. DFBUGS-153 #8561

Are you sure you want to change the base?

nsfs - monitor only nsrs that are mounted. DFBUGS-153 #8561

Conversation

alphaprinz commented Nov 28, 2024 • edited Loading

Explain the changes

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

romayalon Dec 3, 2024

Choose a reason for hiding this comment

alphaprinz Dec 3, 2024

Choose a reason for hiding this comment

romayalon Dec 4, 2024

Choose a reason for hiding this comment

alphaprinz Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

romayalon Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

alphaprinz Dec 6, 2024

Choose a reason for hiding this comment

romayalon Dec 8, 2024

Choose a reason for hiding this comment

alphaprinz commented Nov 28, 2024 •

edited

Loading

alphaprinz Dec 4, 2024 •

edited

Loading

romayalon Dec 5, 2024 •

edited

Loading