-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete stale metric labels from BPF maps #99
base: main
Are you sure you want to change the base?
Conversation
@@ -367,6 +367,7 @@ func (l *loader) startHashMap( | |||
select { | |||
case <-ticker.C: | |||
mapIter := liveMap.Iterate() | |||
labels := make([]map[string]string, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the map have a Len
function to use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very unfortunately it does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okie dokie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just 1 nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on @tjons!
Code looks good to me, just curious how you tested?
No problem @lgadban. I ran tcpconnect to verify that the code didn't break anything but would like to hold off merging this until I finish adding another example program that performs map deletion. I'm working on it today/this evening. |
@@ -80,6 +80,7 @@ exit_tcp_connect(struct pt_regs *ctx, int ret) | |||
val = 1; | |||
} | |||
else { | |||
bpf_map_delete_elem(&sockets, &tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a permanent change and is it required? For the long term, I think it would be nice if we would only alter the upstream examples as little as absolutely needed for maintainability purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krisztianfekete no this is just for testing, I'm working off a VM in gcloud so have been pushing changes for tests. Will remove before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjons this PR looks good. Can you clean up so we can merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this so we don't forget, we need to test this change before merging
@lgadban is correct, we need to do some testing. I'm working on other issues but will make some time this week to revisit this again. |
We cherry picked this commit and tested it. With the new code data you get from prometheus endpoint is not aligned with the one from the ebpf map. The ebpf map shows the correct data (you can look into it both from cli gui or inspecting the pinned map under (/sys/fs/bpf folder), on the other hand the prometheus endpoint is cleared every second. It shows only the delta from the previous reading (1 second tick). If you use an hash counter, it shows the count for each key incremented in the last second or so. |
@andrea-tomassi this is super helpful, thanks for the feedback! We will take a closer look at the implementation and rework it as needed |
@lgadban We did some more accurate testing on that, and we isolated the bug in a more precise way I think. It looks like the actual problem is having the char array as a filed of the struct we use as a key for the bpf map. In fact we use the following:
Output (both into /sys/fs/bpf/sl_process_map and CLI the following:
As you can see some of the keys are identical, and this is not working as intended for an hash map. Now, if we remove the char glcomm[TASK_COMM_LEN]; form struct the key duplication disappear and all seems to works properly (our test code both inserts and deletes keys). We did not tested a u32[] array, so I cannot say if the problem is either the array itself or the "char" data type. Hope this helps... Andrea T |
Sorry, I didn't specified some important detail: this both works the same way into all lines of codes. We specifically tested the "master" branch and the keys deletion in case you don't use char[] is working there. On the Prometheus endpoint side we still get an issue also without the char[] field: in fact the keys are not properly deleted. It seems that keys are not deleted (maybe a part of) properly into userspace. |
@andrea-tomassi thank you so much for the detailed report! We will use this to recreate this issue |
When an entry is removed from a map, we need to delete the associated metric.