Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability to disable specific alert rules Loki #11241

Merged
merged 36 commits into from
Jan 17, 2024

Conversation

Daniel-Vaz
Copy link
Contributor

What this PR does / why we need it:

Currently using the Loki Chart we can only either enable\disable ALL alert rules.
For specific environments and use-cases sometimes not all alert Rules are useful to have enabled.

With this PR change, we can cleanly and through the Chart values disable specific Alerts.

Special notes for your reviewer:

Checklist

Extra Note:

This is my first time doing a PR for this project. I forked the main branch and implemented these changes. Did not change the Chart.yaml version neither the Changelog, but if needed (or any other missing action) just request me and I will do it.

Thank you in advance for all your awesome work.

@Daniel-Vaz Daniel-Vaz requested a review from a team as a code owner November 16, 2023 08:49
@CLAassistant
Copy link

CLAassistant commented Nov 16, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Daniel-Vaz

I appreciate what you're trying to do but I think we need a more generic solution. Can we take a step back and talk about the problem you're trying to solve? Why do you want to disable these specific alerts?

@Daniel-Vaz
Copy link
Contributor Author

Daniel-Vaz commented Nov 16, 2023

Hi @dannykopping , thank you for the feedback !

Can we take a step back and talk about the problem you're trying to solve? Why do you want to disable these specific alerts?

So currently I'm a consumer of this Loki Chart, and since the clusters where I have Loki deployed I also have the Prom Operator, I want to enable the Chart suggested Scrapes\Rules\Alerts.

The Problem I'm facing, is that in some of my clusters indeed we have some "issues" in the overall architecture that result on Loki reporting quite frequently some LokiRequestLatency.

We could just silence them.
Or even we could just disable all the Chart default alerts, and copy them on our side.

But I believe that giving the flexibility to the Chart consumer to pick and chose which alerts they want to have enabled or not would be a nice feature.

For example, on the Kube Prometheus Stack Chart they support this disabling specific alerts capability.

Do you have any other suggestion on how this capability could be implemented ?

Copy link
Contributor

github-actions bot commented Nov 16, 2023

Trivy scan found the following vulnerabilities:

  • HIGH, Target: docker.io/grafana/loki:main-3cb2a6b (alpine 3.18.4), Type: alpine openssl: Incorrect cipher key and IV length processing in libcrypto3 v3.1.3-r0. Fixed in v3.1.4-r0
  • HIGH, Target: docker.io/grafana/loki:main-3cb2a6b (alpine 3.18.4), Type: alpine openssl: Incorrect cipher key and IV length processing in libssl3 v3.1.3-r0. Fixed in v3.1.4-r0
    \nTo see more details on these vulnerabilities, and how/where to fix them, please run docker build -t grafana/loki:main-3cb2a6b -f cmd/loki/Dockerfile .
    trivy i grafana/loki:main-3cb2a6b on your branch. If these were not introduced by your PR, please considering fixing them in via a subsequent PR. Thanks!

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that sounds like a good enough reason to me - and good to see some prior art where this is done the same way. Thanks for that

LGTM! I'll just request a PR from a colleague who knows Helm before we merge.

@shantanualsi
Copy link
Contributor

@Daniel-Vaz have you tried testing when all alerts are set to false and the rules block under loki_alerts is empty?

@Daniel-Vaz
Copy link
Contributor Author

LGTM! I'll just request a PR from a colleague who knows Helm before we merge.

Thank you very much @dannykopping !

@Daniel-Vaz have you tried testing when all alerts are set to false and the rules block under loki_alerts is empty?

@shantanualsi Indeed in that situation the object will fail to be applied because it as no rules under the groups section.
I assumed that it's not likely that people will disable all alerts one by one. In those situations they will just set the monitoring.rules.alerting value to false.

Either way I think that if we find a way to address this corner-case it would indeed increase the reliability of the Chart.
I'm just a bit unsure how could it be implemented :/ any ideas ?

@shantanualsi
Copy link
Contributor

shantanualsi commented Nov 17, 2023

In those situations they will just set the monitoring.rules.alerting value to false.

in our case, you can technically have enabled: true and LokiCanaryLatency: true and the chart will still break. I'm still thinking if there's a better way to address this. The kube-prometheus-stack you referenced in one of the comments above has a python script under hack folder that creates yaml files.. do you know if it addresses this particular case?

I'm also thinking that rules section now has enabled which is a boolean and disabled which is a map of <rules>: true. This can be confusing to other users..

@Daniel-Vaz
Copy link
Contributor Author

Daniel-Vaz commented Nov 17, 2023

in our case, you can technically have enabled: true and LokiCanaryLatency: true and the chart will still break

The way I implemented it is not the case I believe. Since for the LokiCanaryLatency the If condition encapsulates the whole loki_canaries_alerts rules group section.

{{- if not (.Values.monitoring.rules.disabled.LokiCanaryLatency | default false) }}
  - name: "loki_canaries_alerts"
    rules:
      - alert: "LokiCanaryLatency"
        annotations:
          message: |
            {{`{{`}} $labels.job {{`}}`}} is experiencing {{`{{`}} printf "%.2f" $value {{`}}`}}s 99th percentile latency.
        expr: |
          histogram_quantile(0.99, sum(rate(loki_canary_response_latency_seconds_bucket[5m])) by (le, namespace, job)) > 5
        for: "15m"
        labels:
          severity: "warning"
{{- if .Values.monitoring.rules.additionalRuleLabels }}
{{ toYaml .Values.monitoring.rules.additionalRuleLabels | indent 10 }}
{{- end }}
{{- end }}

I did this because that rules group section had only that single alert. If more are added in he future, then that needs to be updated indeed.


The kube-prometheus-stack you referenced in one of the comments above has a python script under hack folder that creates yaml files.. do you know if it addresses this particular case?

I believe that hack script is used in order to have a sync from a external place where alerts are defined. They dont write their own alerts definitions inside the Kube-Prometheus-Stack Chart, they instead just use those script to sync the rules from a external place and format them accordingly.

The usage for those scripts is written here.


I'm also thinking that rules section now has enabled which is a boolean and disabled which is a map of : true. This can be confusing to other users.

I kind of agree with you on this one.
But at the same time the name of those values is quite self explanatory no ?

  • If you want to enable alerting then monitoring.rules.alerting: true
  • If you need to disable a specific rule from the whole group, then : monitoring.rules.disabled: {}

We could also move the disabled to be nested under alerting.
Or even under alerting instead of just supporting <true/false> values, we can change the structure, and there specify the list of rules names, to be enabled\disabled, for example:

monitoring:
  rules:
    alerting:
      LokiTooManyCompactorsRunning: true
      LokiRequestLatency: false
      
# By default all values should be true

@shantanualsi
Copy link
Contributor

shantanualsi commented Nov 20, 2023

Sorry for the delay.

Since for the LokiCanaryLatency the If condition encapsulates the whole loki_canaries_alerts rules group section.

Apologies, I wasn't clear enough here.. in case if all rules under loki_alerts are disabled and rules under loki_canaries_alerts are enabled.

Can you please add some documentation on the values.yaml for the edge case mentioning to set monitoring.rules.alerting to false explicitly if all rules are disabled? Thanks for your patience! :)

@Daniel-Vaz
Copy link
Contributor Author

@shantanualsi Thank you :D !

Updated the values.yaml comments, whenever you want feel free to have a look.

@JStickler
Copy link
Contributor

Also, please note that when you update the Helm charts, you have to bump the version of the Helm Chart. This is mentioned in the PR checklist.

For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

@shantanualsi
Copy link
Contributor

@Daniel-Vaz can you also please increment the helm version and add a changelog as suggested by @JStickler

@Daniel-Vaz
Copy link
Contributor Author

@JStickler @shantanualsi

Done, Thank you for the warning and sorry for missing that part previously.

auto-merge was automatically disabled December 12, 2023 11:11

Head branch was pushed to by a user without write access

@dannykopping
Copy link
Contributor

@Daniel-Vaz there seem to be some CI failures (the helm linter and the helm docs checks, specifically).

Please address these, and then we can get this merged.

@Daniel-Vaz
Copy link
Contributor Author

I updated the version reference to be 5.41.6.
I ran helm-docs without issues.
But based on the drone ci workflow error I tried to run the command make -BC docs sources/setup/install/helm/reference.md, but I'm getting the following error:

$ make -BC docs sources/setup/install/helm/reference.md
make: Entering directory '/home/user/repos/loki/docs'
podman run --rm --volume "/home/user/repos/loki:/helm-docs" -u "$(id -u)" "docker.io/jnorwood/helm-docs:v1.11.0" \
        -c /helm-docs/production/helm/ \
        -t reference.md.gotmpl \
        -o reference.md
mv "../production/helm/loki/reference.md" "sources/setup/install/helm/reference.md"
time="2024-01-10T09:33:13Z" level=info msg="Found Chart directories [loki]"
time="2024-01-10T09:33:13Z" level=info msg="Generating README Documentation for chart /helm-docs/production/helm/loki"
time="2024-01-10T09:33:13Z" level=warning msg="Could not open chart README file /helm-docs/production/helm/loki/README.md, skipping chart"
mv: cannot stat '../production/helm/loki/reference.md': No such file or directory
make: *** [Makefile:15: sources/setup/install/helm/reference.md] Error 1
make: Leaving directory '/home/user/repos/loki/docs'

@dannykopping
Copy link
Contributor

I updated the version reference to be 5.41.6. I ran helm-docs without issues. But based on the drone ci workflow error I tried to run the command make -BC docs sources/setup/install/helm/reference.md, but I'm getting the following error:

$ make -BC docs sources/setup/install/helm/reference.md
make: Entering directory '/home/user/repos/loki/docs'
podman run --rm --volume "/home/user/repos/loki:/helm-docs" -u "$(id -u)" "docker.io/jnorwood/helm-docs:v1.11.0" \
        -c /helm-docs/production/helm/ \
        -t reference.md.gotmpl \
        -o reference.md
mv "../production/helm/loki/reference.md" "sources/setup/install/helm/reference.md"
time="2024-01-10T09:33:13Z" level=info msg="Found Chart directories [loki]"
time="2024-01-10T09:33:13Z" level=info msg="Generating README Documentation for chart /helm-docs/production/helm/loki"
time="2024-01-10T09:33:13Z" level=warning msg="Could not open chart README file /helm-docs/production/helm/loki/README.md, skipping chart"
mv: cannot stat '../production/helm/loki/reference.md': No such file or directory
make: *** [Makefile:15: sources/setup/install/helm/reference.md] Error 1
make: Leaving directory '/home/user/repos/loki/docs'

Hhmm, i just ran it and it worked fine; did you run this from the root of the project?

You can always copy the changes in manually from this commit:
377c7f8

@pull-request-size pull-request-size bot added size/M and removed size/S labels Jan 16, 2024
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Jan 16, 2024
@Daniel-Vaz
Copy link
Contributor Author

@dannykopping
I thinks its ready to be merged now.
Thanks

@dannykopping
Copy link
Contributor

Oh man what a nightmare, this CI is failing on whitespace now 💀 🙃
Thank you for your patience here @Daniel-Vaz! I think we're very nearly there.

@Daniel-Vaz
Copy link
Contributor Author

@dannykopping
No worries 😅 I'm, the one who is sorry for this mess.

I see the workflow failing indeed on what it seems to be a removed line\whitespace....
But on my side running helm-docs and helm-lint dont result in any change:

$ helm-docs
INFO[2024-01-17T07:53:09Z] Found Chart directories [production/helm/loki]
INFO[2024-01-17T07:53:09Z] Generating README Documentation for chart production/helm/loki

$ make helm-lint
make -BC production/helm/loki lint
make[1]: Entering directory '/home/user/repos/loki/production/helm/loki'
yamllint -c /home/user/repos/loki/production/helm/loki/src/.yamllint.yaml /home/user/repos/loki/production/helm/loki/src
make[1]: Leaving directory '/home/user/repos/loki/production/helm/loki'

$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

@dannykopping
Copy link
Contributor

@dannykopping No worries 😅 I'm, the one who is sorry for this mess.

I see the workflow failing indeed on what it seems to be a removed line\whitespace.... But on my side running helm-docs and helm-lint dont result in any change:

$ helm-docs
INFO[2024-01-17T07:53:09Z] Found Chart directories [production/helm/loki]
INFO[2024-01-17T07:53:09Z] Generating README Documentation for chart production/helm/loki

$ make helm-lint
make -BC production/helm/loki lint
make[1]: Entering directory '/home/user/repos/loki/production/helm/loki'
yamllint -c /home/user/repos/loki/production/helm/loki/src/.yamllint.yaml /home/user/repos/loki/production/helm/loki/src
make[1]: Leaving directory '/home/user/repos/loki/production/helm/loki'

$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

I think just manually delete that line then, and let's get this merged.
No need to keep merging main in, by the way.

@Daniel-Vaz
Copy link
Contributor Author

Fingers crossed 🤞

@dannykopping dannykopping merged commit dc1cb52 into grafana:main Jan 17, 2024
10 checks passed
@dannykopping
Copy link
Contributor

😌 thanks for your patience @Daniel-Vaz! We appreciate the contribution a lot!

rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
**What this PR does / why we need it**:

Currently using the Loki Chart we can only either enable\disable ALL
alert rules.
For specific environments and use-cases sometimes not all alert Rules
are useful to have enabled.

With this PR change, we can cleanly and through the Chart values disable
specific Alerts.
mraboosk pushed a commit to mraboosk/loki that referenced this pull request Oct 7, 2024
**What this PR does / why we need it**:

Currently using the Loki Chart we can only either enable\disable ALL
alert rules.
For specific environments and use-cases sometimes not all alert Rules
are useful to have enabled.

With this PR change, we can cleanly and through the Chart values disable
specific Alerts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm size/M type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants