-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsfs - monitor only nsrs that are mounted. DFBUGS-153 #8561
base: master
Are you sure you want to change the base?
Conversation
3234a46
to
a7c5ed9
Compare
background_scheduler.register_bg_worker(new NamespaceMonitor({ | ||
name: 'namespace_fs_monitor', | ||
client: internal_rpc_client, | ||
should_monitor: nsr => Boolean(nsr.nsfs_config && process.env['NSFS_NSR_' + nsr.name]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions -
- If the endpoint got up before the namespace resource was mounted, when is the next time we will get to this flow for start monitoring? why not add a retry after 60 seconds -
nsfs | wait for endpoint startup before namespace monitor registration #8474 (comment) - Why avoid start monitoring instead of externalizing that the value of process.env['NSFS_NSR_' + nsr.name] is undefined which means that the PV was not mounted yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- An endpoint that was started before the mount will be deleted after a new endpoint is created with the new mount.
The retry will not help as the endpoint that opened the report is removed when the new nsfs nsr mount is added (after nsfs nsr was created in kubernetes cluster). - There could be other nsfs nsrs that should be monitored (correct me if I'm wrong).
Maybe I'll make the scenario more concrete-
- Operator install a system in a cluster.
- There is endpoint A. It does NOT have any nsfs nsr mounts.
- At some point, an nsfs nsr is created in the cluster.
- In reconcile:
- a. operator adds a mount for the nsfs nsr to endpoints' container.
- b. operator creates an nsr object in system store.
- A new endpoint B with the new mount is created by kubernetes.
- While B is being created, A updates its system store, reads the nsfs nsr. The new nsfs nsr is NOT mounted in A. A reports NOENT on the nsfs nsr. Note since default interval for nsfs nsr monitoring is less than creating a new endpoint, this doesn't necessarily happen. Reducing config.NAMESPACE_MONITOR_DELAY will ensure bug reproduction.
- Endpoint B is ready. Endpoint A is deleted. Nsr status is stuck in rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned on Slack, I think that the correct path is not to avoid monitoring a namespace resource that is still not mounted but add this check to the monitoring process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nsr will be monitored by the new endpoint.
The old endpoint is about to be deleted.
The only difference this commit makes is that old endpoints won't mistakenly report nsr as rejected.
If you think that the about-to-be-deleted endpoint should do something about the mount it will never have (or anything else, for that matter) please specify it explicitly. The current "add this to monitoring process" is too vague. Also specify explicitly if this is an enhancement or part of the bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"About-to-be-deleted" is the happy path :)
There is also the sad path where there is an issue with the mounting and it takes a while/never happens - that's exactly why I think it's important and avoiding monitoring it if it's not mounted is a partial solution from my prespective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not trying to solve monitoring, but rather to fix a bug in monitoring.
I'm not removing any feature that we currently have.
Again, I would like a more specific way to proceed.
If you think a different fix or an enhancement to the monitoring is needed, please specify it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alphaprinz
I already explained it in the above comment, but I'll be happy to summarize my comments -
My specific idea for solving it -
Instead of not monitoring unmounted namespace resources, I think you should move the new condition you added inside the monitoring check, and externalize that this is the current issue that the namespace resource has.
Comment 1, bullet 2
Comment 3
Why I think that my suggestion is a better behavior / user experience -
It will behave better in cases where the re-start of the endpoint takes time/won't happen at all.
Comment 5
How to proceed -
The above summary of comments is my opinion/suggestion/how I would fix it.
IMO, You shall proceed from here as how you see it, fix it, open an issue and call it enhancement, document this gap or anything else you feel appropriate.
Signed-off-by: Amit Prinz Setter <[email protected]>
a7c5ed9
to
50c80bf
Compare
Explain the changes
Use the new NSFS_NSR_ env variable to test whether nsr should be mounted.
(see nsfs - add mounted nsr name to env. DFBUGS-153 noobaa-operator#1481)
This commit reverts 2789d60.
Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
Reduce config.NAMESPACE_MONITOR_DELAY to 1000ms.
Create nsfs nsr
nsr should not be rejected (by endpoints existing before its creation).