-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support collect logs for failed agents and controller for supportbundle #3659
base: main
Are you sure you want to change the base?
Support collect logs for failed agents and controller for supportbundle #3659
Conversation
Fix #3624 A few questions though:
|
Codecov Report
@@ Coverage Diff @@
## main #3659 +/- ##
==========================================
- Coverage 64.47% 64.45% -0.02%
==========================================
Files 295 294 -1
Lines 43815 43726 -89
==========================================
- Hits 28248 28184 -64
+ Misses 13291 13252 -39
- Partials 2276 2290 +14
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to also create tar.gz for these logs? I only put the logs in a dir, i am not sure if tar.gz is a mandatory request? Seems use different format for normal nodes and failed nodes ( dir and tar.gz) is more appropriate.
I think it's better to create a tar.gz file for the failed node, too. We have a failed_nodes file in the support bundle to record the failed nodes. The user can know which nodes are failed from that file. Maybe there are some third-party script which analyses the logs, we'd better use tar.gz files for all nodes to make third-party scripts easier to handle the support bundle.
Another reason is that from the user's point of view, the user doesn't know why some nodes are in dirs and some nodes are in a tar.gz. The user will be curious of why support bundle uses different file/dir for different nodes.
Alternatively, if you want to just use dir, add a _failed
prefix/infix/suffix to the dir names. This will be more self-explanatory.
the filename of log file is also different. For the failed nodes, current format is <container-name>.log, not including log level and timestamp. Again, i am not sure is there external tools would rely on these old formats ?
I believe there some tools relying on the log message format, but I don't suggest to update the log message in support bundle code for adapting to those tools. Let's keep just everything collected from K8s as-is. Maybe it's better to just present the raw contents collected from K8s, and have the third-party tools adapt to the format themselves.
@edwardbadboy All updated. The questions of log timestamp is about the flog file name. Maybe i can show you with a latest test results:
This is the directory structure for the failed nodes. And this is the normal ones:
The main difference is the timestamp part in the filename, which we cannot get from the Kubernetes api. or we can, but i'm not sure it's worth the effort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current layout is OK. We don't have to fetch timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for the change.
A reminder: it seems that the patch failed in some tests, we need to fix them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change.
ci is unstable now. @edwardbadboy @mengdie-song Could you help review this again? I have add the support to dump agentinfo/controllerinfo when failed. Besides that, some additional notes about this pr:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change, please check the import format, other parts LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
the related unit test has failed. working on that now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change. I guess it's not enough time for 1.7.0. Maybe we can put it in 1.7.1?
@edwardbadboy All updated. Please help review again, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nit comments
31e529e
to
26c7a5d
Compare
@tnqn @edwardbadboy test passed. Please have a review again. I have copied the compress dir function to a new place for resue, however, update the rest.go to use this new function keep causing the unit test to fail (hash value not match), not sure what's the root cause, so i revoke this part, and a little code duplication has been added. |
This PR is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days |
26c7a5d
to
40c9d09
Compare
933a339
to
16d109f
Compare
16d109f
to
91ec08b
Compare
cc @antoninbas Can you also help review this? It's a pretty old PR, but i think its still useful. |
c40484a
to
4092ff6
Compare
f1295d3
to
4d58ac0
Compare
} | ||
return nil | ||
}(); err != nil { | ||
errors = append(errors, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we return early here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error could also caused by Marlshal or WriteFile. In these cases, podRef
is still valid.
if podRef != nil { | ||
pod, err := k8sClient.CoreV1().Pods(podRef.Namespace).Get(ctx, podRef.Name, metav1.GetOptions{}) | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we add an error to the aggregate if podRef
is nil
or if the Get
API call returns an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If get failed and podRef is nil, it's already handled in the sub-function above.
errors = append(errors, err) | ||
} | ||
} | ||
return utilerror.NewAggregate(errors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I am actually not sure we need an aggregate for this function, since there is only one controller Pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the controller case it may seems a bit of complicated. I think the use of aggregated error is to get any many information as possible in a broken
environment, not only for multiple pods cases. if we can get the controllerInfo, but failed to marshal and write it to the disk(in very rare cases, maybe can just return instead of aggregate the errors) , this won't affect we continue to retrieve logs from the pod.
Anyway, i refactored this part(both controller and agent) to just return the error instead aggregate them for each pod's case.
} | ||
|
||
// downloadAndPackPodLogs download pod's logs and compress them to the target dir. `tmpDir` is used to store the logs file momentarily. | ||
func downloadAndPackPodLogs(ctx context.Context, k8sClient kubernetes.Interface, pod *corev1.Pod, dir string, tmpDir string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is misleading IMO. It sounds like it would download the logs and then create an archive with them. But actually it packages everything that is already in tmpDir
if I understand the code correctly. It should probably be 2 separate functions (plus there is already a function in charge of downloading Pod logs, so we just need a function to create the tarball).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@antoninbas Can you take a look at this again? Thanks |
expectFileName := "logs/agent/antrea-agent.log" | ||
if node == "" { | ||
expectFileName = "logs/controller/antrea-controller.log" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would probably be better to have a list of expected files inside the bundle, and then check that it matches the actual list of files by calling assert.ElementsMatch
. Don't we have more files than that in the fallback bundle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on the test pod's containers settings, currently there is only one.
Well compare all path should be ideal, including all filenames and sub-dirs , i'm not sure this worth the effort. I already added a new function 'UnpackDir', it's only use-case is for this test. More helper functions is needed if we want to have a full compare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on the test pod's containers settings, currently there is only one.
Don't we also have "agentinfo" / "controllerinfo"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with more check.
Signed-off-by: Hang Yan <[email protected]>
b0a69ea
to
1782904
Compare
"": { | ||
filepath.Join("logs", "controller", "antrea-controller.log"), | ||
}, | ||
"node-1": { | ||
"agentinfo", | ||
filepath.Join("logs", "ovs", "antrea-ovs.log"), | ||
filepath.Join("logs", "agent", "antrea-agent.log"), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see "agentinfo" / "controllerinfo" in the list, is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure what's going wrong, but i did added agentinfo
in the list.
As for controllerinfo
, it's a single file in the top-level, not bundled in the controller tar.gz , so i added an extra check, not in the expect list.
Signed-off-by: Hang Yan <[email protected]>
7d990f7
to
1159ad6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
When the normal supportbundle api failed for some nodes or controller,
use kubernetes api instead to collect logs. Also, in either case,
clusterinfo will always be gathered first.
Signed-off-by: Hang Yan [email protected]