Support collect logs for failed agents and controller for supportbundle #3659

hangyan · 2022-04-19T02:52:59Z

When the normal supportbundle api failed for some nodes or controller,
use kubernetes api instead to collect logs. Also, in either case,
clusterinfo will always be gathered first.

Signed-off-by: Hang Yan [email protected]

hangyan · 2022-04-19T02:57:29Z

Fix #3624

@edwardbadboy

A few questions though:

Do we need to also create tar.gz for these logs? I only put the logs in a dir, i am not sure if tar.gz is a mandatory request? Seems use different format for normal nodes and failed nodes ( dir and tar.gz) is more appropriate.
the filename of log file is also different. For the failed nodes, current format is <container-name>.log, not including log level and timestamp. Again, i am not sure is there external tools would rely on these old formats ?

codecov-commenter · 2022-04-19T05:34:05Z

Codecov Report

Merging #3659 (684dca3) into main (42162ce) will decrease coverage by 0.01%.
The diff coverage is 39.58%.

❗ Current head 684dca3 differs from pull request most recent head 26c7a5d. Consider uploading reports for the commit 26c7a5d to get more accurate results

@@            Coverage Diff             @@
##             main    #3659      +/-   ##
==========================================
- Coverage   64.47%   64.45%   -0.02%     
==========================================
  Files         295      294       -1     
  Lines       43815    43726      -89     
==========================================
- Hits        28248    28184      -64     
+ Misses      13291    13252      -39     
- Partials     2276     2290      +14

Flag	Coverage Δ
kind-e2e-tests	`50.64% <45.78%> (-0.46%)`	⬇️
unit-tests	`44.20% <34.37%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
...icluster/cmd/multicluster-controller/controller.go	`7.86% <0.00%> (-0.47%)`	⬇️
multicluster/cmd/multicluster-controller/main.go	`0.00% <0.00%> (ø)`
pkg/agent/util/ndp/ndp.go	`0.00% <0.00%> (ø)`
...lers/multicluster/commonarea/remote_common_area.go	`27.45% <25.45%> (+1.44%)`	⬆️
...g/agent/cniserver/interface_configuration_linux.go	`16.66% <26.66%> (-0.29%)`	⬇️
...llers/multicluster/member_clusterset_controller.go	`16.45% <27.47%> (+6.88%)`	⬆️
pkg/agent/agent.go	`53.11% <30.00%> (-0.11%)`	⬇️
.../cmd/multicluster-controller/clusterset_webhook.go	`59.37% <59.37%> (ø)`
...llers/multicluster/leader_clusterset_controller.go	`64.02% <60.00%> (ø)`
...agent/flowexporter/connections/deny_connections.go	`84.94% <87.50%> (+1.03%)`	⬆️
... and 39 more

edwardbadboy

Do we need to also create tar.gz for these logs? I only put the logs in a dir, i am not sure if tar.gz is a mandatory request? Seems use different format for normal nodes and failed nodes ( dir and tar.gz) is more appropriate.

I think it's better to create a tar.gz file for the failed node, too. We have a failed_nodes file in the support bundle to record the failed nodes. The user can know which nodes are failed from that file. Maybe there are some third-party script which analyses the logs, we'd better use tar.gz files for all nodes to make third-party scripts easier to handle the support bundle.

Another reason is that from the user's point of view, the user doesn't know why some nodes are in dirs and some nodes are in a tar.gz. The user will be curious of why support bundle uses different file/dir for different nodes.

Alternatively, if you want to just use dir, add a _failed prefix/infix/suffix to the dir names. This will be more self-explanatory.

the filename of log file is also different. For the failed nodes, current format is <container-name>.log, not including log level and timestamp. Again, i am not sure is there external tools would rely on these old formats ?

I believe there some tools relying on the log message format, but I don't suggest to update the log message in support bundle code for adapting to those tools. Let's keep just everything collected from K8s as-is. Maybe it's better to just present the raw contents collected from K8s, and have the third-party tools adapt to the format themselves.

pkg/antctl/raw/supportbundle/command.go

hangyan · 2022-05-25T10:18:29Z

@edwardbadboy All updated. The questions of log timestamp is about the flog file name. Maybe i can show you with a latest test results:

logs/
├── agent
│   └── antrea-agent.log
├── install-cni
│   └── install-cni.log
└── ovs
    └── antrea-ovs.log

This is the directory structure for the failed nodes. And this is the normal ones:

logs/
├── agent
│   ├── antrea-agent.u2.root.log.ERROR.20220525-083747.1
│   ├── antrea-agent.u2.root.log.ERROR.20220525-084258.1
│   ├── antrea-agent.u2.root.log.ERROR.20220525-084759.1
│   ├── antrea-agent.u2.root.log.ERROR.20220525-085212.1
│   ├── antrea-agent.u2.root.log.FATAL.20220525-083459.1
│   ├── antrea-agent.u2.root.log.FATAL.20220525-083747.1
│   ├── antrea-agent.u2.root.log.FATAL.20220525-084258.1
│   ├── antrea-agent.u2.root.log.FATAL.20220525-084759.1
│   ├── antrea-agent.u2.root.log.INFO.20220525-083747.1
│   ├── antrea-agent.u2.root.log.INFO.20220525-084258.1
│   ├── antrea-agent.u2.root.log.INFO.20220525-084759.1
│   ├── antrea-agent.u2.root.log.INFO.20220525-085210.1
│   ├── antrea-agent.u2.root.log.WARNING.20220525-083747.1
│   ├── antrea-agent.u2.root.log.WARNING.20220525-084258.1
│   ├── antrea-agent.u2.root.log.WARNING.20220525-084759.1
│   └── antrea-agent.u2.root.log.WARNING.20220525-085212.1
└── ovs
    ├── ovsdb-server.log
    └── ovs-vswitchd.log

The main difference is the timestamp part in the filename, which we cannot get from the Kubernetes api. or we can, but i'm not sure it's worth the effort.

edwardbadboy

I think the current layout is OK. We don't have to fetch timestamp.

pkg/antctl/raw/supportbundle/command.go

pkg/util/compress/compress.go

edwardbadboy

Looks good to me. Thanks for the change.

A reminder: it seems that the patch failed in some tests, we need to fix them.

mengdie-song

Thanks for the change.

pkg/antctl/raw/supportbundle/command.go

hangyan · 2022-06-02T05:41:35Z

ci is unstable now.

@edwardbadboy @mengdie-song Could you help review this again? I have add the support to dump agentinfo/controllerinfo when failed. Besides that, some additional notes about this pr:

when dump the agentinfo, the current agent pod and the podref in agentinfo may not match, because when error happens, a new agent pod is created, but the agentinfo is not up-to-date. In that case, we still dump the agentinfo, match by host.
The current supportbundle process heaily rely on agentinfo/controllerinfo exist. When antrea failed to start at the first place, there will be no agentinfo exist, so no data will be collected at all. We still can retrieve related info from kubernetes. This can be improved in the future to totally remove the need for agentinfo/controllerinfo when something went wrong and we cannot fetch this infomation, instead we only get the target info from kubernetes.

mengdie-song

Thanks for the change, please check the import format, other parts LGTM.

pkg/apiserver/registry/system/supportbundle/rest.go

pkg/antctl/raw/supportbundle/command.go

mengdie-song

LGTM

hangyan · 2022-06-08T06:59:03Z

the related unit test has failed. working on that now.

edwardbadboy

Thanks for the change. I guess it's not enough time for 1.7.0. Maybe we can put it in 1.7.1?

pkg/antctl/raw/supportbundle/command.go

pkg/util/compress/compress.go

hangyan · 2022-06-15T10:53:17Z

@edwardbadboy All updated. Please help review again, thanks!

tnqn

some nit comments

pkg/antctl/raw/supportbundle/command.go

hangyan · 2022-07-14T05:51:15Z

@tnqn @edwardbadboy test passed. Please have a review again. I have copied the compress dir function to a new place for resue, however, update the rest.go to use this new function keep causing the unit test to fail (hash value not match), not sure what's the root cause, so i revoke this part, and a little code duplication has been added.

pkg/util/compress/compress.go

pkg/antctl/raw/supportbundle/command.go

github-actions · 2022-10-26T00:47:53Z

This PR is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

hangyan · 2024-11-20T07:02:33Z

cc @antoninbas Can you also help review this? It's a pretty old PR, but i think its still useful.
cc @luolanzone

pkg/antctl/raw/supportbundle/command.go

pkg/antctl/raw/supportbundle/command_test.go

pkg/util/k8s/pod.go

pkg/antctl/raw/supportbundle/command.go

antoninbas · 2024-12-16T21:11:51Z

pkg/antctl/raw/supportbundle/command.go

+		}
+		return nil
+	}(); err != nil {
+		errors = append(errors, err)


shouldn't we return early here?

the error could also caused by Marlshal or WriteFile. In these cases, podRef is still valid.

pkg/antctl/raw/supportbundle/command.go

antoninbas · 2024-12-16T21:12:52Z

pkg/antctl/raw/supportbundle/command.go

+	if podRef != nil {
+		pod, err := k8sClient.CoreV1().Pods(podRef.Namespace).Get(ctx, podRef.Name, metav1.GetOptions{})
+		if err == nil {


shouldn't we add an error to the aggregate if podRef is nil or if the Get API call returns an error?

If get failed and podRef is nil, it's already handled in the sub-function above.

antoninbas · 2024-12-16T21:14:04Z

pkg/antctl/raw/supportbundle/command.go

+			errors = append(errors, err)
+		}
+	}
+	return utilerror.NewAggregate(errors)


By the way, I am actually not sure we need an aggregate for this function, since there is only one controller Pod?

in the controller case it may seems a bit of complicated. I think the use of aggregated error is to get any many information as possible in a broken environment, not only for multiple pods cases. if we can get the controllerInfo, but failed to marshal and write it to the disk(in very rare cases, maybe can just return instead of aggregate the errors) , this won't affect we continue to retrieve logs from the pod.

Anyway, i refactored this part(both controller and agent) to just return the error instead aggregate them for each pod's case.

pkg/antctl/raw/supportbundle/command.go

antoninbas · 2024-12-16T21:21:51Z

pkg/antctl/raw/supportbundle/command.go

+}
+
+// downloadAndPackPodLogs download pod's logs and compress them to the target dir. `tmpDir` is used to store the logs file momentarily.
+func downloadAndPackPodLogs(ctx context.Context, k8sClient kubernetes.Interface, pod *corev1.Pod, dir string, tmpDir string) error {


this function is misleading IMO. It sounds like it would download the logs and then create an archive with them. But actually it packages everything that is already in tmpDir if I understand the code correctly. It should probably be 2 separate functions (plus there is already a function in charge of downloading Pod logs, so we just need a function to create the tarball).

pkg/antctl/raw/supportbundle/command_test.go

hangyan · 2024-12-24T03:19:54Z

@antoninbas Can you take a look at this again? Thanks

pkg/antctl/raw/supportbundle/command.go

pkg/util/k8s/pod.go

pkg/util/compress/compress.go

pkg/antctl/raw/supportbundle/command_test.go

pkg/antctl/raw/supportbundle/command.go

antoninbas · 2025-01-02T20:10:19Z

pkg/antctl/raw/supportbundle/command_test.go

+					expectFileName := "logs/agent/antrea-agent.log"
+					if node == "" {
+						expectFileName = "logs/controller/antrea-controller.log"
+					}


it would probably be better to have a list of expected files inside the bundle, and then check that it matches the actual list of files by calling assert.ElementsMatch. Don't we have more files than that in the fallback bundle?

depends on the test pod's containers settings, currently there is only one.

Well compare all path should be ideal, including all filenames and sub-dirs , i'm not sure this worth the effort. I already added a new function 'UnpackDir', it's only use-case is for this test. More helper functions is needed if we want to have a full compare.

depends on the test pod's containers settings, currently there is only one.

Don't we also have "agentinfo" / "controllerinfo"?

updated with more check.

Signed-off-by: Hang Yan <[email protected]>

pkg/antctl/raw/supportbundle/command_test.go

antoninbas · 2025-01-10T18:58:00Z

pkg/antctl/raw/supportbundle/command_test.go

+				"": {
+					filepath.Join("logs", "controller", "antrea-controller.log"),
+				},
+				"node-1": {
+					"agentinfo",
+					filepath.Join("logs", "ovs", "antrea-ovs.log"),
+					filepath.Join("logs", "agent", "antrea-agent.log"),
+				},


I still don't see "agentinfo" / "controllerinfo" in the list, is this expected?

i'm not sure what's going wrong, but i did added agentinfo in the list.

As for controllerinfo, it's a single file in the top-level, not bundled in the controller tar.gz , so i added an extra check, not in the expect list.

Signed-off-by: Hang Yan <[email protected]>

antoninbas

LGTM

antoninbas · 2025-01-16T22:32:46Z

/test-all

edwardbadboy requested changes May 20, 2022

View reviewed changes

edwardbadboy reviewed May 25, 2022

View reviewed changes

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

pkg/util/compress/compress.go Outdated Show resolved Hide resolved

edwardbadboy previously approved these changes May 26, 2022

View reviewed changes

mengdie-song reviewed May 26, 2022

View reviewed changes

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

hangyan dismissed edwardbadboy’s stale review via a550b11 June 2, 2022 03:14

mengdie-song reviewed Jun 2, 2022

View reviewed changes

pkg/apiserver/registry/system/supportbundle/rest.go Outdated Show resolved Hide resolved

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

mengdie-song previously approved these changes Jun 6, 2022

View reviewed changes

edwardbadboy reviewed Jun 8, 2022

View reviewed changes

hangyan dismissed mengdie-song’s stale review via 9f92351 June 15, 2022 08:35

tnqn reviewed Jun 20, 2022

View reviewed changes

hangyan force-pushed the supportbundle-for-failed-nodes-controller branch 5 times, most recently from 31e529e to 26c7a5d Compare July 13, 2022 09:19

tnqn reviewed Jul 27, 2022

View reviewed changes

pkg/util/compress/compress.go Outdated Show resolved Hide resolved

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

pkg/antctl/raw/supportbundle/command.go Show resolved Hide resolved

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2022

github-actions bot closed this Jan 25, 2023

hangyan reopened this Nov 15, 2024

hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from 26c7a5d to 40c9d09 Compare November 15, 2024 03:35

luolanzone removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2024

hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from 933a339 to 16d109f Compare November 20, 2024 06:39

hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from 16d109f to 91ec08b Compare November 20, 2024 06:40

hangyan requested a review from antoninbas November 20, 2024 06:51

hangyan requested a review from tnqn November 20, 2024 09:25

hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from c40484a to 4092ff6 Compare November 20, 2024 09:25

antoninbas added this to the Antrea v2.3 release milestone Nov 20, 2024

antoninbas reviewed Nov 20, 2024

View reviewed changes

antoninbas reviewed Nov 21, 2024

View reviewed changes

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

pkg/antctl/raw/supportbundle/command.go Outdated Show resolved Hide resolved

hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from f1295d3 to 4d58ac0 Compare December 16, 2024 10:31

antoninbas reviewed Dec 16, 2024

View reviewed changes

hangyan mentioned this pull request Dec 24, 2024

Support antctl command for packetcapture #6884

Open

1 task

antoninbas reviewed Jan 2, 2025

View reviewed changes

Support collect logs for failed agents and controller for supportbundle

1782904

Signed-off-by: Hang Yan <[email protected]>

hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from b0a69ea to 1782904 Compare January 9, 2025 03:40

antoninbas reviewed Jan 10, 2025

View reviewed changes

test

1159ad6

Signed-off-by: Hang Yan <[email protected]>

hangyan force-pushed the supportbundle-for-failed-nodes-controller branch from 7d990f7 to 1159ad6 Compare January 14, 2025 07:35

antoninbas approved these changes Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support collect logs for failed agents and controller for supportbundle #3659

Support collect logs for failed agents and controller for supportbundle #3659

hangyan commented Apr 19, 2022

hangyan commented Apr 19, 2022

codecov-commenter commented Apr 19, 2022 •

edited by codecov bot

Loading

edwardbadboy left a comment

hangyan commented May 25, 2022

edwardbadboy left a comment

edwardbadboy left a comment

mengdie-song left a comment

hangyan commented Jun 2, 2022

mengdie-song left a comment

mengdie-song left a comment

hangyan commented Jun 8, 2022

edwardbadboy left a comment

hangyan commented Jun 15, 2022

tnqn left a comment

hangyan commented Jul 14, 2022 •

edited

Loading

github-actions bot commented Oct 26, 2022

hangyan commented Nov 20, 2024

antoninbas Dec 16, 2024

hangyan Dec 17, 2024

antoninbas Dec 16, 2024

hangyan Dec 17, 2024

antoninbas Dec 16, 2024

hangyan Dec 17, 2024 •

edited

Loading

antoninbas Dec 16, 2024

hangyan Dec 17, 2024

hangyan commented Dec 24, 2024

antoninbas Jan 2, 2025

hangyan Jan 8, 2025 •

edited

Loading

antoninbas Jan 8, 2025

hangyan Jan 9, 2025

antoninbas Jan 10, 2025

hangyan Jan 14, 2025 •

edited

Loading

antoninbas left a comment

antoninbas commented Jan 16, 2025

Support collect logs for failed agents and controller for supportbundle #3659

Are you sure you want to change the base?

Support collect logs for failed agents and controller for supportbundle #3659

Conversation

hangyan commented Apr 19, 2022

hangyan commented Apr 19, 2022

codecov-commenter commented Apr 19, 2022 • edited by codecov bot Loading

Codecov Report

edwardbadboy left a comment

Choose a reason for hiding this comment

hangyan commented May 25, 2022

edwardbadboy left a comment

Choose a reason for hiding this comment

edwardbadboy left a comment

Choose a reason for hiding this comment

mengdie-song left a comment

Choose a reason for hiding this comment

hangyan commented Jun 2, 2022

mengdie-song left a comment

Choose a reason for hiding this comment

mengdie-song left a comment

Choose a reason for hiding this comment

hangyan commented Jun 8, 2022

edwardbadboy left a comment

Choose a reason for hiding this comment

hangyan commented Jun 15, 2022

tnqn left a comment

Choose a reason for hiding this comment

hangyan commented Jul 14, 2022 • edited Loading

github-actions bot commented Oct 26, 2022

hangyan commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hangyan Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hangyan commented Dec 24, 2024

Choose a reason for hiding this comment

hangyan Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hangyan Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas commented Jan 16, 2025

codecov-commenter commented Apr 19, 2022 •

edited by codecov bot

Loading

hangyan commented Jul 14, 2022 •

edited

Loading

hangyan Dec 17, 2024 •

edited

Loading

hangyan Jan 8, 2025 •

edited

Loading

hangyan Jan 14, 2025 •

edited

Loading