Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL: pgnodemx: expected 1, got 0, lines from file /sys/fs/cgroup/user.slice/user-1000.slice/session-c3.scope/cgroup.controllers #26

Open
df7cb opened this issue Sep 14, 2024 · 4 comments

Comments

@df7cb
Copy link
Contributor

df7cb commented Sep 14, 2024

In Debian's CI environment, the pgnodemx regression tests fail:

### PostgreSQL 17 installcheck ###
Creating new PostgreSQL cluster 17/regress ...
Error: /usr/lib/postgresql/17/bin/pg_ctl /usr/lib/postgresql/17/bin/pg_ctl start -D /tmp/pg_virtualenv.eo7Un1/data/17/regress -l /tmp/pg_virtualenv.eo7Un1/log/postgresql-17-regress.log -c -s -o  -c config_file="/tmp/pg_virtualenv.eo7Un1/postgresql/17/regress/postgresql.conf"  exited with status 1: 
2024-09-14 18:00:33.606 UTC [3233] FATAL:  pgnodemx: expected 1, got 0, lines from file /sys/fs/cgroup/user.slice/user-1000.slice/session-c3.scope/cgroup.controllers
2024-09-14 18:00:33.606 UTC [3233] LOG:  database system is shut down
pg_ctl: could not start server

https://salsa.debian.org/postgresql/pgnodemx/-/jobs/6281060

Perhaps that problem should be a WARNING instead of preventing startup?

@keithf4
Copy link
Collaborator

keithf4 commented Sep 16, 2024

Is this only happening on 17?

@gregscds
Copy link

The most likely situation I've found where this file could be empty is if a Docker environment is setup as Rootless, which seems more likely to be a thing that could have changed most recently rather than Keith's guess that it's PG17. pgnodemx is expecting a line like this:

cpu memory pids

And in a rootless environment, not only can you sometimes not monitor those things, you can not even necessarily see what could be monitored. Our code expects the things to monitor to vary, but no one considered an environment so locked down the list was empty.

I'm not sure if just degrading this to a WARNING gets us to an ideal place. The whole point of pgnodemx is to collect data like this, so if there's nothing there to collect, there's nothing for the program to do. Poking around at what's happening and what some other projects do, there seem a few equally sensible options with good and bad implications:

  • Drop the error to a WARNING, just don't really do anything, and trust the program will be reaped by the container cleanup. That will let cases like the heavily locked down Debian CI environment validate the build works and no one has to do anything else. But CI systems will have to note the warning to even begin to realize that nothing was actually tested, and those are not programs whose output is typically reviewed to that level.

  • Exit with some comment when this happens but don't throw an error code, just point out the file is empty. That will get types of some CI tests to pass, others will note the program exited immediately and at least realize something is off. Like just warning, no one will still actually be testing anything, but that the program built and only started momentarily should flag more people to notice the problem exists via the quick exit.

  • Disable this part of CI testing for the module to explicitly document there's an issue. That will require everyone doing builds to individually do testing adjustments to their respective build scripts, but at least no one will be surprised about whether testing does or doesn't happen. For example, in the similar situation at cgroups: use cgroup.controllers to read controllers containers/podman#11935 I see "we have disabled this test in the CI because it is difficult to know what controllers are going to be enabled for rootless under all conditions we test"; that rings true. I don't think it's necessarily in anyone's best interests to force that level of policy today, on everyone, just to package our relatively simple program.

  • Document the problem and at least one simple solution, which seems to be delegating enough rights to the environment that pgnodemx finds one entry it can check it out; see https://rootlesscontaine.rs/getting-started/common/cgroup2/#enabling-cpu-cpuset-and-io-delegation for example. This will heavily push packagers toward a real test run of the program. That seems to me an even more inappropriately burdensome approach for us to dictate at this point in time.

We should probably provide a simple solution that doesn't punish Christoph for being the once to spot the problem; have our build/package group setup our own Rootless test environment to do further development; and then do the work to document The Right Way for CI testing that packagers should adopt. And if that goes well, then maybe we start removing ways to bypass the testing.

(I hope I'm not wrong about the root cause altogether, because that would mean I just wasted a lot of typing)

@df7cb
Copy link
Contributor Author

df7cb commented Sep 17, 2024

Thanks for the investigation.

In fact, the test never worked before:
https://salsa.debian.org/postgresql/pgnodemx/-/pipelines
https://salsa.debian.org/postgresql/pgnodemx/-/jobs/5836544 - that's with 16 and the same FATAL.

At the moment the problem isn't critical for me, the CI tests I care about are running on apt.postgresql.org, and there they work. The CI pipeline on salsa.debian.org is just a nice extra to have even more package-related checks running.

What made me write the issue is that it seems to prevent startup. Wouldn't it make more sense to throw an ERROR only when someone queries the stats? Not starting up could be a bad time bomb, perhaps people upgrade the kernel or some kernel settings change, and then on the next reboot or crash, the database suddenly doesn't start anymore.

@gregscds
Copy link

Additional context appreciated. Knowing we're not causing you a serious CI issue is a relief.

I think the question you're right to raise is what about the person who starts their database without the stats there, then someone fixes the problem by granting the right permissions. Shouldn't the stats then start to work? They might not even be able to manually restart their pgnodemx if it gave up and died.

Since the implications of these rootless changes slipped by as something no one ever considered before, I think we need a little design review session that reconsiders error handling for a few of these use cases. Maybe even adjust our idle/sleeping behavior. Thanks for the input, we'll tag the issue when we do something about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants