Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRITICAL BUG] Grafana Promtail - CRITICAL issue - initial scrape config causes 100% CPU- and Memory load #11398

Closed
janfickler opened this issue Dec 6, 2023 · 13 comments
Labels
help wanted We would love help on these issues. Please come help us! type/bug Somehing is not working as expected

Comments

@janfickler
Copy link

Describe the bug

The RPM-Packages for "promtail" from the official Grafana OSS repository including an initial scrape config which causes that CPU- and Memory-Usage going to 100% immeditially after rpm-installation and automatic service start (system behaviour).

This makes the systems completely unmanagable. SSH is working, but under this pressure not usable.

effected configuration - /etc/promtail/config.yml

`server:
http_listen_port: 9080
grpc_listen_port: 0

positions:
filename: /tmp/positions.yaml

clients:

scrape_configs:

  • job_name: system
    static_configs:
    • targets:
      • localhost
        labels:
        job: varlogs
        path: /var/log/*log`

after service is stopped in any way and the config is deleted or changed, then the application is working properly, but this initial config looks into every file under "/var/log/*" which brings systems under enormous pressure.

To Reproduce

  • install promtail latest rpm from grafana oss rpm repository with included config.yml

(was tested on CentOS 7 Systems and AlmaLinux 8 Systems, so all RedHat systems or all systems which using rpm-packages are affected)

Expected behavior

  • service will start properly with initial config.yml

Possible Solution

  • create a new RPM-Package version of the application - latest is currently 2.92 --> newer would be then 2.93, which has an adjusted initial config. Adjusted config file should just point on a dedicated file like /var/log/messages or so.

effected Systems

  • RPM-based systems

criticality

  • extremely critical on productive systems - would cause unusable systems !!!
@janfickler janfickler changed the title Grafana Promtail - CRITICAL issue - initial scrape config causes 100% CPU- and Memory load [CRITICAL BUG] Grafana Promtail - CRITICAL issue - initial scrape config causes 100% CPU- and Memory load Dec 8, 2023
@JStickler JStickler added the type/bug Somehing is not working as expected label Dec 11, 2023
@janfickler
Copy link
Author

janfickler commented Dec 14, 2023

@JStickler,
issue still persists in latest package - promtail-2.9.3.x86_64,rpm

`scrape_configs:

  • job_name: system
    static_configs:
    • targets:
      • localhost
        labels:
        job: varlogs
        path: /var/log/*log`

As described in the description for initial config it would be feasible to point the initial config to a dedicated existing file (e.g. /var/log/messages) or a non-existing file (e.g. /var/log/file.log).

please create a new rpm-package with one of these options. This would fix the issue.

@JStickler JStickler added the help wanted We would love help on these issues. Please come help us! label Dec 14, 2023
@JStickler
Copy link
Contributor

@janfickler I'm a technical writer, not a developer. But I'll see if anyone on the team has time to look at this issue.

@kavirajk
Copy link
Contributor

@janfickler thanks for reporting the issue. Unfortunate it's impacting your production systems.

May I know what is the amount of files we are talking about here in /var/log/*log?. Trying to understand if having huge number of files on target path is impacting this, or some memory issue in promtail itself. You able to reproduce it on empty or less files in /var/log/ target? You can also try multiple targets for /var/log/ each with different set of sub-directories or files. Can you reproduce in that way?

As described in the description for initial config it would be feasible to point the initial config to a dedicated existing file (e.g. /var/log/messages) or a non-existing file (e.g. /var/log/file.log).

I'm failing to understand what do you mean here. Can you elaborate? Thanks

@janfickler
Copy link
Author

janfickler commented Dec 14, 2023

@kavirajk,
it is necessary that the initially config is pointing to a dedicated file instead of /var/log/*log (That takes all files in the folder and this folder is never empty - system folder, comes with installation and holds also system logs), because thats causing the issue.

and as i said this affecting all Systems RPM / Deb based, because the folder /var/log/ is never empty and initially it should point to a dedicated file or a not existing file. This affects not only our systems, it should affect all customers.

I tested that and with initial install systems becomes directly overloaded (CPU 100% / Memory 100%).

the solution would be easy,

change in config.yml

`scrape_configs:
job_name: system
static_configs:

  • targets:
    • localhost
      labels:
      job: varlogs
      path: /var/log/*log`

to:

`scrape_configs:
job_name: system
static_configs:

  • targets:
    • localhost
      labels:
      job: varlogs
      path: /var/log/file.log`

--> /var/log/*log - to - /var/log/file.log or /var/log/messages or /var/log/kern.log

and everything should be fine and out of that you can then create the new RPM / DEB Package.
This should solve all initial problems and afterwards the customer can create his own config without unstable / overloaded system.

Also checked Plattform - kvm, VMWare, vSphere or Azure based vms - all Had the same issue with OS Ubuntu, CentOS 7 / 8, Almalinux 8 / 9, etc.

btw. i tested afterwards with an own config pointing to separate dedicated files and this works without problems.
It is only that - /var/log/*log - which causes systems to get overloaded with the initial installation of the package, because systems starting the promtail-service imidiatelly after install (system behaviour, which couldn´t be changed).

@kavirajk
Copy link
Contributor

kavirajk commented Dec 15, 2023

thanks for the explanation @janfickler. It sounds very reasonable for initial config not to overload the system at the startup.

Now we need to figure out what would be the right file or files (instead of /var/log/*log) that works for all different platforms? (should that even be a goal?).

Also this made me to think, problem can still happen if user deliberately added /var/log/*log in single target and started the service.

So there is an opportunities for long-term to add some kind of upper bound on amount of files(can also be based on size of the files) promtail can scrape at once without overloading the complete system. Currently it starts tailing all the matched files at once, and each file is handled by separate goroutine.

My proposal is following,

  1. Change the inital config not to have /var/log/*log and have single file instead (basically what @janfickler proposed)
  2. [For the long-term] Add some upper bound on amount of files a single target can process at the same time without overloading the system.

with (2) we also have to make sure, this change is available in Grafana agent as well.

@janfickler
Copy link
Author

janfickler commented Dec 15, 2023

I think you could try file like /var/log/messages or something that is at initial OS-Installation everytime on the systems.

RPM- and DEB-based systems has normally the same files in /var/log/ from my experience.


Just another information to that.
In the folder /var/log/ are also laying Files with binary content Like "/var/log/lastlog"

It could be that the pattern /var/log/*log is also matching for this File and causes the Problems in addition.

I think for now the Option 1 would be a good fix as workaround. For Option 2 i think this is a good way to make it make bulletproof, but i guess it should be tested deeper If Not only the issue depends on Files with binary content and maybe If promtail detects binary content it should ignore it. But i suggest this should be developed in Long Term.

kavirajk added a commit that referenced this issue Dec 18, 2023
Related issue: #11398

This minimal config scrape only single file thus not overloading the systems as described in the issue

Signed-off-by: Kaviraj <[email protected]>
@janfickler
Copy link
Author

@kavirajk, i guess you are in holiday :-)
have a nice christmas, maybe you can finish the packaging if you are back next year ?

@janfickler
Copy link
Author

janfickler commented Jan 3, 2024

any update @kavirajk
@JStickler could you push that a little bit with the development team ? :-)

@janfickler
Copy link
Author

@JStickler is there anyway to push that a little bit ?
looks like the build job from @kavirajk is not working properly, but since 3 weeks nothing happens.

@JStickler
Copy link
Contributor

@janfickler end of December is when a lot of the team takes vacation because if they don't use their time off, they lose it. So it's understandable that not much has happened in the past two or three weeks. But people are coming back to work, and I see that @kavirajk has already commented on the PR that he's taking a look at it.

kavirajk added a commit that referenced this issue Jan 14, 2024
Related issue: #11398

This minimal config scrape only single file thus not overloading the
systems as described in the issue
grafanabot pushed a commit that referenced this issue Jan 14, 2024
Related issue: #11398

This minimal config scrape only single file thus not overloading the
systems as described in the issue

(cherry picked from commit 86f2001)
poyzannur added a commit that referenced this issue Jan 17, 2024
…ackaging. (#11676)

Backport 86f2001 from #11511

---

**What this PR does / why we need it**:
Related issue: #11398

This minimal config scrape only single file thus not overloading the
systems as described in the issue

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
- [ ] If the change is deprecating or removing a configuration option,
update the `deprecated-config.yaml` and `deleted-config.yaml` files
respectively in the `tools/deprecated-config-checker` directory.
[Example
PR](0d4416a)

Co-authored-by: Kaviraj Kanagaraj <[email protected]>
Co-authored-by: Poyzan <[email protected]>
@janfickler
Copy link
Author

@poyzannur / @kavirajk - there is still no new version 2.9.4 released ?

@janfickler
Copy link
Author

janfickler commented Jan 25, 2024

@poyzannur can confirm, that rpm-/deb-package is now available, thx a lot :-)

https://github.com/grafana/loki/releases/tag/v2.9.4

`Resolving Dependencies
--> Running transaction check
---> Package promtail.x86_64 0:2.9.2-1 will be updated
---> Package promtail.x86_64 0:2.9.4-1 will be an update
--> Finished Dependency Resolution

Dependencies Resolved

==================================================================================================================================================================================================================
Package Arch Version Repository Size

Updating:
promtail x86_64 2.9.4-1 grafana 26 M

Transaction Summary

Upgrade 1 Package
`

@janfickler
Copy link
Author

thx for the support guys :-)

rhnasc pushed a commit to inloco/loki that referenced this issue Apr 12, 2024
…na#11511)

Related issue: grafana#11398

This minimal config scrape only single file thus not overloading the
systems as described in the issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We would love help on these issues. Please come help us! type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

3 participants