-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVSHAS-9189] Scan will stuck in scheduling after controller is shutdown and restarted #155
Conversation
Shouldn't the modified shared files(.proto & .pg.go files in neuvector/neuvector#1461) under neuvector/scanner/vendor/github.com/neuvector/neuvector/share be included in this PR? |
monitor/monitor.c
Outdated
if ((period = getenv(ENV_HEALTH_CHECK_PERIOD)) != NULL) { | ||
args[a ++] = "--period"; | ||
args[a ++] = period; | ||
} | ||
if ((retry_max = getenv(ENV_RETRY)) != NULL) { | ||
args[a ++] = "--retry_max"; | ||
args[a ++] = retry_max; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that environment variable will be passed to scanner, so you don't have to parse these settings in monitor. Is this not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HEALTH_CHECK_PERIOD/MAX_RETRY can be configured in scanner yaml
monitor as the entry process in scanner container, it parses the env vars & translate them to --period/--retry_max arguments when creating scanner process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @holyspectral,
The scanner.go is triggered by monitor.c. To configure the HEALTH_CHECK_PERIOD and MAX_RETRY correctly, we need to add these values to the environment variables. This way, the scanner can read them correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @williamlin-suse,
After a quick discussion with @holyspectral, we realized that monitor.c may be shared by many components, and we're not sure if making changes might break some behavior. Would it be okay to use os.GetEnv() to read the environment variables instead of passing them through monitor.c?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add, I think monitor's design purpose is to monitor and restart its child process, so it's not a perfect place to write component-specific logic. Technically we can still put some logic into monitor if we really want to, but maybe we should only keep generic ones there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scanner repo has its own monitor code.
what do you mean "monitor.c may be shared by many components" ?
I think another PR is needed for neuvector-helm repo (github.com/neuvector/neuvector-helm/charts/core/templates/scanner-deployment.yaml) regarding the 2 new env variables added in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think env vars can be provided to scanner pod via the existing cve.scanner.env
already, but we probably should document these settings somewhere for sure. Do you have any places in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can submit a jira case to Nuno for doc update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should export those variables unless they are useful for customers to escalate their situation. @pohanhuangtw please use a default pair of the reasonable value instead.
server.go
Outdated
|
||
// To ensure the controller's availability, periodCheckHealth use HealthCheck to periodically check if the controller is alive. | ||
// Additionally, if the controller is deleted or not responsive, the scanner will re-register. | ||
func periodCheckHealth(joinIP string, joinPort uint16, data *share.ScannerRegisterData, cb *clientCallback, healthCheckCh chan struct{}, done chan bool, period, retryMax int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: You can use context, which is the standard way to control go routine, to replace healthCheckCh
. You can also assign timeout to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will remain the same, since we may trigger the re-register when we have the idle state.
Thus I think we should have a way to close the channel manually.
scanner.go
Outdated
@@ -98,12 +98,14 @@ func dbRead(path string, maxRetry int, output string) map[string]*share.ScanVuln | |||
} | |||
} | |||
|
|||
func connectController(path, advIP, joinIP, selfID string, advPort uint32, joinPort uint16) { | |||
func connectController(path, advIP, joinIP, selfID string, advPort uint32, joinPort uint16, period, retryMax int, doneCh chan bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: I know we have many joinIP
everywhere, but it can actually contains service name of controller. Maybe consider renaming it to joinHost
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a Host target. The joinServiceAddr or joinServiceIP is better. It is okay to keep joinIP, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like joinServiceAddr
. Considering scenarios like docker, how about joinAddr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please separate this PR in two parts sequentially:
(1) Merge the neuvector/share
from its main branch.
(2) The code changes at scanner side.
Most likely, it is an internal improvement. We do not need to address this timeout and retry in the documentation. The default values should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the original variable names. Thanks.
…once, shorten the check period.
…the scanner if the scanner is not in the controller list.
a4667a5
to
3395b8a
Compare
…o avoid old controller to keep register.
3395b8a
to
bf28afe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good
@pohanhuangtw Please update the average time to trigger the recovery procedure in the JIRA case. |
Summary
Test Performed