Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support collect logs for failed agents and controller for supportbundle #3659

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hangyan
Copy link
Member

@hangyan hangyan commented Apr 19, 2022

When the normal supportbundle api failed for some nodes or controller,
use kubernetes api instead to collect logs. Also, in either case,
clusterinfo will always be gathered first.

Signed-off-by: Hang Yan [email protected]

@hangyan
Copy link
Member Author

hangyan commented Apr 19, 2022

Fix #3624

@edwardbadboy

A few questions though:

  1. Do we need to also create tar.gz for these logs? I only put the logs in a dir, i am not sure if tar.gz is a mandatory request? Seems use different format for normal nodes and failed nodes ( dir and tar.gz) is more appropriate.
  2. the filename of log file is also different. For the failed nodes, current format is <container-name>.log, not including log level and timestamp. Again, i am not sure is there external tools would rely on these old formats ?

@codecov-commenter
Copy link

codecov-commenter commented Apr 19, 2022

Codecov Report

Merging #3659 (684dca3) into main (42162ce) will decrease coverage by 0.01%.
The diff coverage is 39.58%.

❗ Current head 684dca3 differs from pull request most recent head 26c7a5d. Consider uploading reports for the commit 26c7a5d to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3659      +/-   ##
==========================================
- Coverage   64.47%   64.45%   -0.02%     
==========================================
  Files         295      294       -1     
  Lines       43815    43726      -89     
==========================================
- Hits        28248    28184      -64     
+ Misses      13291    13252      -39     
- Partials     2276     2290      +14     
Flag Coverage Δ
kind-e2e-tests 50.64% <45.78%> (-0.46%) ⬇️
unit-tests 44.20% <34.37%> (-0.07%) ⬇️
Impacted Files Coverage Δ
...icluster/cmd/multicluster-controller/controller.go 7.86% <0.00%> (-0.47%) ⬇️
multicluster/cmd/multicluster-controller/main.go 0.00% <0.00%> (ø)
pkg/agent/util/ndp/ndp.go 0.00% <0.00%> (ø)
...lers/multicluster/commonarea/remote_common_area.go 27.45% <25.45%> (+1.44%) ⬆️
...g/agent/cniserver/interface_configuration_linux.go 16.66% <26.66%> (-0.29%) ⬇️
...llers/multicluster/member_clusterset_controller.go 16.45% <27.47%> (+6.88%) ⬆️
pkg/agent/agent.go 53.11% <30.00%> (-0.11%) ⬇️
.../cmd/multicluster-controller/clusterset_webhook.go 59.37% <59.37%> (ø)
...llers/multicluster/leader_clusterset_controller.go 64.02% <60.00%> (ø)
...agent/flowexporter/connections/deny_connections.go 84.94% <87.50%> (+1.03%) ⬆️
... and 39 more

Copy link
Contributor

@edwardbadboy edwardbadboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to also create tar.gz for these logs? I only put the logs in a dir, i am not sure if tar.gz is a mandatory request? Seems use different format for normal nodes and failed nodes ( dir and tar.gz) is more appropriate.

I think it's better to create a tar.gz file for the failed node, too. We have a failed_nodes file in the support bundle to record the failed nodes. The user can know which nodes are failed from that file. Maybe there are some third-party script which analyses the logs, we'd better use tar.gz files for all nodes to make third-party scripts easier to handle the support bundle.

Another reason is that from the user's point of view, the user doesn't know why some nodes are in dirs and some nodes are in a tar.gz. The user will be curious of why support bundle uses different file/dir for different nodes.

Alternatively, if you want to just use dir, add a _failed prefix/infix/suffix to the dir names. This will be more self-explanatory.

the filename of log file is also different. For the failed nodes, current format is <container-name>.log, not including log level and timestamp. Again, i am not sure is there external tools would rely on these old formats ?

I believe there some tools relying on the log message format, but I don't suggest to update the log message in support bundle code for adapting to those tools. Let's keep just everything collected from K8s as-is. Maybe it's better to just present the raw contents collected from K8s, and have the third-party tools adapt to the format themselves.

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
@hangyan
Copy link
Member Author

hangyan commented May 25, 2022

@edwardbadboy All updated. The questions of log timestamp is about the flog file name. Maybe i can show you with a latest test results:

logs/
├── agent
│   └── antrea-agent.log
├── install-cni
│   └── install-cni.log
└── ovs
    └── antrea-ovs.log

This is the directory structure for the failed nodes. And this is the normal ones:

logs/
├── agent
│   ├── antrea-agent.u2.root.log.ERROR.20220525-083747.1
│   ├── antrea-agent.u2.root.log.ERROR.20220525-084258.1
│   ├── antrea-agent.u2.root.log.ERROR.20220525-084759.1
│   ├── antrea-agent.u2.root.log.ERROR.20220525-085212.1
│   ├── antrea-agent.u2.root.log.FATAL.20220525-083459.1
│   ├── antrea-agent.u2.root.log.FATAL.20220525-083747.1
│   ├── antrea-agent.u2.root.log.FATAL.20220525-084258.1
│   ├── antrea-agent.u2.root.log.FATAL.20220525-084759.1
│   ├── antrea-agent.u2.root.log.INFO.20220525-083747.1
│   ├── antrea-agent.u2.root.log.INFO.20220525-084258.1
│   ├── antrea-agent.u2.root.log.INFO.20220525-084759.1
│   ├── antrea-agent.u2.root.log.INFO.20220525-085210.1
│   ├── antrea-agent.u2.root.log.WARNING.20220525-083747.1
│   ├── antrea-agent.u2.root.log.WARNING.20220525-084258.1
│   ├── antrea-agent.u2.root.log.WARNING.20220525-084759.1
│   └── antrea-agent.u2.root.log.WARNING.20220525-085212.1
└── ovs
    ├── ovsdb-server.log
    └── ovs-vswitchd.log

The main difference is the timestamp part in the filename, which we cannot get from the Kubernetes api. or we can, but i'm not sure it's worth the effort.

Copy link
Contributor

@edwardbadboy edwardbadboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current layout is OK. We don't have to fetch timestamp.

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/util/compress/compress.go Outdated Show resolved Hide resolved
edwardbadboy
edwardbadboy previously approved these changes May 26, 2022
Copy link
Contributor

@edwardbadboy edwardbadboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the change.

A reminder: it seems that the patch failed in some tests, we need to fix them.

Copy link
Contributor

@mengdie-song mengdie-song left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change.

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
@hangyan
Copy link
Member Author

hangyan commented Jun 2, 2022

ci is unstable now.

@edwardbadboy @mengdie-song Could you help review this again? I have add the support to dump agentinfo/controllerinfo when failed. Besides that, some additional notes about this pr:

  1. when dump the agentinfo, the current agent pod and the podref in agentinfo may not match, because when error happens, a new agent pod is created, but the agentinfo is not up-to-date. In that case, we still dump the agentinfo, match by host.
  2. The current supportbundle process heaily rely on agentinfo/controllerinfo exist. When antrea failed to start at the first place, there will be no agentinfo exist, so no data will be collected at all. We still can retrieve related info from kubernetes. This can be improved in the future to totally remove the need for agentinfo/controllerinfo when something went wrong and we cannot fetch this infomation, instead we only get the target info from kubernetes.

Copy link
Contributor

@mengdie-song mengdie-song left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, please check the import format, other parts LGTM.

pkg/apiserver/registry/system/supportbundle/rest.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
mengdie-song
mengdie-song previously approved these changes Jun 6, 2022
Copy link
Contributor

@mengdie-song mengdie-song left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hangyan
Copy link
Member Author

hangyan commented Jun 8, 2022

the related unit test has failed. working on that now.

Copy link
Contributor

@edwardbadboy edwardbadboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. I guess it's not enough time for 1.7.0. Maybe we can put it in 1.7.1?

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/util/compress/compress.go Outdated Show resolved Hide resolved
@hangyan
Copy link
Member Author

hangyan commented Jun 15, 2022

@edwardbadboy All updated. Please help review again, thanks!

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nit comments

pkg/antctl/raw/supportbundle/command.go Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
@hangyan hangyan force-pushed the supportbundle-for-failed-nodes-controller branch 5 times, most recently from 31e529e to 26c7a5d Compare July 13, 2022 09:19
@hangyan
Copy link
Member Author

hangyan commented Jul 14, 2022

@tnqn @edwardbadboy test passed. Please have a review again. I have copied the compress dir function to a new place for resue, however, update the rest.go to use this new function keep causing the unit test to fail (hash value not match), not sure what's the root cause, so i revoke this part, and a little code duplication has been added.

pkg/util/compress/compress.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Show resolved Hide resolved
@github-actions
Copy link
Contributor

This PR is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2022
@github-actions github-actions bot closed this Jan 25, 2023
@hangyan hangyan reopened this Nov 15, 2024
@hangyan hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from 26c7a5d to 40c9d09 Compare November 15, 2024 03:35
@luolanzone luolanzone removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2024
@hangyan hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from 933a339 to 16d109f Compare November 20, 2024 06:39
@hangyan hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from 16d109f to 91ec08b Compare November 20, 2024 06:40
@hangyan hangyan requested a review from antoninbas November 20, 2024 06:51
@hangyan
Copy link
Member Author

hangyan commented Nov 20, 2024

cc @antoninbas Can you also help review this? It's a pretty old PR, but i think its still useful.
cc @luolanzone

@hangyan hangyan requested a review from tnqn November 20, 2024 09:25
@hangyan hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from c40484a to 4092ff6 Compare November 20, 2024 09:25
@antoninbas antoninbas added this to the Antrea v2.3 release milestone Nov 20, 2024
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command_test.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command_test.go Outdated Show resolved Hide resolved
pkg/util/k8s/pod.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
@hangyan hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from f1295d3 to 4d58ac0 Compare December 16, 2024 10:31
}
return nil
}(); err != nil {
errors = append(errors, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we return early here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error could also caused by Marlshal or WriteFile. In these cases, podRef is still valid.

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
Comment on lines 757 to 759
if podRef != nil {
pod, err := k8sClient.CoreV1().Pods(podRef.Namespace).Get(ctx, podRef.Name, metav1.GetOptions{})
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we add an error to the aggregate if podRef is nil or if the Get API call returns an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If get failed and podRef is nil, it's already handled in the sub-function above.

errors = append(errors, err)
}
}
return utilerror.NewAggregate(errors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I am actually not sure we need an aggregate for this function, since there is only one controller Pod?

Copy link
Member Author

@hangyan hangyan Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the controller case it may seems a bit of complicated. I think the use of aggregated error is to get any many information as possible in a broken environment, not only for multiple pods cases. if we can get the controllerInfo, but failed to marshal and write it to the disk(in very rare cases, maybe can just return instead of aggregate the errors) , this won't affect we continue to retrieve logs from the pod.

Anyway, i refactored this part(both controller and agent) to just return the error instead aggregate them for each pod's case.

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
}

// downloadAndPackPodLogs download pod's logs and compress them to the target dir. `tmpDir` is used to store the logs file momentarily.
func downloadAndPackPodLogs(ctx context.Context, k8sClient kubernetes.Interface, pod *corev1.Pod, dir string, tmpDir string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is misleading IMO. It sounds like it would download the logs and then create an archive with them. But actually it packages everything that is already in tmpDir if I understand the code correctly. It should probably be 2 separate functions (plus there is already a function in charge of downloading Pod logs, so we just need a function to create the tarball).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

pkg/antctl/raw/supportbundle/command_test.go Outdated Show resolved Hide resolved
@hangyan
Copy link
Member Author

hangyan commented Dec 24, 2024

@antoninbas Can you take a look at this again? Thanks

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved
pkg/util/k8s/pod.go Outdated Show resolved Hide resolved
pkg/util/compress/compress.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command_test.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command_test.go Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command_test.go Outdated Show resolved Hide resolved
pkg/antctl/raw/supportbundle/command.go Show resolved Hide resolved
Comment on lines 430 to 468
expectFileName := "logs/agent/antrea-agent.log"
if node == "" {
expectFileName = "logs/controller/antrea-controller.log"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would probably be better to have a list of expected files inside the bundle, and then check that it matches the actual list of files by calling assert.ElementsMatch. Don't we have more files than that in the fallback bundle?

Copy link
Member Author

@hangyan hangyan Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on the test pod's containers settings, currently there is only one.

Well compare all path should be ideal, including all filenames and sub-dirs , i'm not sure this worth the effort. I already added a new function 'UnpackDir', it's only use-case is for this test. More helper functions is needed if we want to have a full compare.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on the test pod's containers settings, currently there is only one.

Don't we also have "agentinfo" / "controllerinfo"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with more check.

@hangyan hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from b0a69ea to 1782904 Compare January 9, 2025 03:40
pkg/antctl/raw/supportbundle/command_test.go Outdated Show resolved Hide resolved
Comment on lines +423 to +430
"": {
filepath.Join("logs", "controller", "antrea-controller.log"),
},
"node-1": {
"agentinfo",
filepath.Join("logs", "ovs", "antrea-ovs.log"),
filepath.Join("logs", "agent", "antrea-agent.log"),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see "agentinfo" / "controllerinfo" in the list, is this expected?

Copy link
Member Author

@hangyan hangyan Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure what's going wrong, but i did added agentinfo in the list.

As for controllerinfo, it's a single file in the top-level, not bundled in the controller tar.gz , so i added an extra check, not in the expect list.

Signed-off-by: Hang Yan <[email protected]>
@hangyan hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from 7d990f7 to 1159ad6 Compare January 14, 2025 07:35
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

/test-all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants