Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nginx_stage nginx_clean cleans up active PUN #3111

Closed
robinkar opened this issue Oct 6, 2023 · 11 comments · Fixed by #3511
Closed

nginx_stage nginx_clean cleans up active PUN #3111

robinkar opened this issue Oct 6, 2023 · 11 comments · Fixed by #3511

Comments

@robinkar
Copy link
Contributor

robinkar commented Oct 6, 2023

We are having a strange bug in one of our OOD environments, where active PUNs are being cleaned up when running /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean.
This happens, for example, even when there is an active file transfer (using Rclone in this case) in progress and the user's browser is actively querying the progress. The cleanup then causes the file transfer to be aborted. The OOD versions we are using are 3.0.1 and 3.0.2.
We run OOD in a container, and even though two of the environments are running exactly the same image, it only happens in one of them.
Having the terminal (shell app) open in OOD prevents the PUN from being cleaned up.

With the dashboard open and an active transfer using Rclone, nginx_clean and the lsof command used by it gives the following output:

[root@ood-testing /]# date ; /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --skip-nginx; echo ; lsof -F piu /var/run/ondemand-nginx/robkarls/passenger.sock
Fri Oct  6 09:27:17 UTC 2023
robkarls

p2708
u0
f5
i412849896
p2734
u10018239
f5
i412849896

Here the PUN is considered inactive by nginx_stage. Process 2708 here is nginx: master process (robkarls) and 2734 is the nginx: worker process.

In the other OOD environment, where it is working as expected, this is the output:

[root@ood-future /]# date ; /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --skip-nginx ; echo ;lsof -F piu /var/run/ondemand-nginx/robkarls/passenger.sock
Fri Oct  6 10:06:49 UTC 2023

p758
u0
f5
i12826345

Process 758 here is nginx: master process (robkarls).

After opening the shell app in both of these environments, these are the outputs from the commands:

[root@ood-testing /]# date ; /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --skip-nginx ; echo ;lsof -F piu /var/run/ondemand-nginx/robkarls/passenger.sock
Fri Oct  6 10:07:29 UTC 2023

p2708
u0
f5
i412849896
p2734
u10018239
f3
i413408799
f5
i412849896
[root@ood-future /]# date ; /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --skip-nginx ; echo ;lsof -F piu /var/run/ondemand-nginx/robkarls/passenger.sock
Fri Oct  6 10:06:49 UTC 2023

p758
u0
f5
i12826345

The logic for counting active sessions seems to happen here. In the output from lsof in the problematic OOD environment (ood-testing), without the shell app open, the inode 412849896 occurs twice, which means it is filtered out by .select{|k.v| v.size == 1}, which seems wrong. Active sessions in that case is 0. Later, when the shell app is open, the inode 413408799 occurs, passing the check there and counts as an active session.
There seems to be a few things that could be wrong here, although I am not sure what exactly is wrong. Is the nginx worker process supposed to have the socket open at all, as it happens only in the problematic environment? The comment for that line of code seems outdated as these sockets are open by the root nginx process, not the apache proxy as the comment seems to indicate, is the logic still up to date after 7 years?

@osc-bot osc-bot added this to the Backlog milestone Oct 6, 2023
@johrstrom johrstrom modified the milestones: Backlog, 3.1 Oct 6, 2023
@johrstrom
Copy link
Contributor

Thank you for the very detailed description and debugging. I'll have to dig in myself and see what's what.

@robinkar
Copy link
Contributor Author

robinkar commented Oct 9, 2023

We thought a bit more about how this could be solved.
One potential solution could be to use the Passenger instance files more directly in nginx_clean rather than counting the amount of processes that have the Passenger socket open using lsof.
By default the sockets for the Passenger apps will be under /tmp/passenger.*/apps.s/, which are remove once Passenger kills the app due to being idle (passenger_pool_idle_time). So to determine if an user is active one could check for active Passenger apps (sockets), and check the owner of them:

[robkarls@ood-testing ~]$ find /tmp -type s -path "/tmp/passenger*/apps.s/*" 2>/dev/null
/tmp/passenger.hseDzvr/apps.s/wsgi.685faddd1f7e34ee
/tmp/passenger.hseDzvr/apps.s/ruby.OkGSUWNyoW0WthRS2mpAAEmkMgaQqF6EyuHSlN4g0lqTF8TMsdxDqXAPEOvdwp9
/tmp/passenger.hseDzvr/apps.s/node.30lgjd
/tmp/passenger.hseDzvr/apps.s/node.127ock

Some additional checks if the /tmp/passenger.*/{core,watchdog}.pid processes are alive would probably be needed.

As the /tmp/passenger.* directory can be customized using passenger_instance_registry_dir, that would probably also have to be taken into account, although I doubt customization of that is common.

This is just an idea though, there could always be some edge cases with how Passenger behaves that I am not aware of.

@robinkar
Copy link
Contributor Author

It seems like nginx_stage nginx_clean does not do anything in the environments unaffected by this bug. Even though we have PUNs that have been inactive (no connections, no Passenger apps running) over 12 hours, their processes (Passenger+nginx) are still running and are not seen as inactive by nginx_stage nginx_clean.

@johrstrom
Copy link
Contributor

environments unaffected by this bug

What's the difference between these environments?

@robinkar
Copy link
Contributor Author

environments unaffected by this bug

What's the difference between these environments?

We haven't been able to track down / pinpoint any meaningful difference that would lead to this. For every difference between ood-testing and ood-future, we have the same config or setup in some other environment where the problem does not occur.

In the environment where the problem occurs (ood-testing), we have sandbox apps enabled, while in ood-future we don't, but at the same time we have another environment with sandbox apps enabled where the problem does not happen.
As for the OOD configuration, the environment where it does occur (ood-testing) has passenger_friendly_error_pages: "on" set in nginx_stage.yml, which the other environment (ood-future) does not have that set. ood-testing also has app_root explicitly defined to the default value, but that should not have any practical impact at all.
One difference is that ood-future and ood-testing run on different hosts, although we also have another environment running on the same host as ood-testing, which is not affected by the bug. Other than that, the environments are identical, running the same container image.

@robinkar
Copy link
Contributor Author

robinkar commented Jan 29, 2024

This is causing a bit more problems that initially thought, so I have done more investigation into the problem. In some environments, this bug seems to be that active PUNs are force-cleaned, and in other environments this causes PUNs to never be cleaned (considered to always be active). I have not yet found any environment where nginx_clean works as I would expect it to work.
A solution to this would be very nice, as this is causing quite a few problems now:

  • The force-killing of PUNs terminates any file transfer the user has in progress.
  • No cleanup of PUNs leads to two problems:
    • User groups are inherited from the PUN, so if the user is added to a new group they won't be able to access that groups files in the files browser until they manually restart their PUN, or the OOD instance is restarted.
    • An excessive amount of processes are running, which has caused us to hit limits on the number of PIDs in the containers, even though the limits are quite high already. This also consumes more resources than would otherwise be needed by OOD.

The difference that causes this is whether the nginx worker process has passenger.sock open or not. I've found that changing from RHEL8 to CentOS 7 seems to cause changes in that behaviour, but it might be just coincidence.

With 3.0.3 RPMs and a rockylinux:8 container, the nginx worker process seems to have the socket open (like ood-testing above, all sessions considered inactive), while on a centos:7 container, only the master process seems to have that socket open (like ood-future above, all sessions considered active).
I also attempted to see if the behaviour has been different with other Passenger versions, but at least in OOD 1.6 (CentOS 7, Passenger 5), the nginx master process was the only one that had the socket open.

The key thing here, however, is that none of these environments behave correctly since the activity of the user does not seem to be reflected at all in the open file descriptors (other than running the shell app).

@johrstrom
Copy link
Contributor

I'll have to look at where all socket_file is being used, but for nginx clean it's here:

I think a naive implementation may be something like this. Attached to that class, I don't think you'd need to pass user into the method, I just made this hacking around.

def sessions(user)
  `ps  -o cmd -u #{user}`.split("\n").select do |command|
    command.match?(/Passenger [\w]+App:/)
  end.count
end

nginx processes will always be running. But the Passenger processes' will only kick in once you have an active session. Passenger apps stop themselves, but nginx does not - hence this nginx_clean action here.

So my theory is - if you have Passenger processes (that haven't stopped themselves) you have active sessions. If you don't have any Passenger processes (because they've stopped themselves) we can go ahead and stop nginx too.

@robinkar
Copy link
Contributor Author

It seems like that implementation would work well enough. From what I've seen, the PUNs work according to the theory you've described. Though there is a difference between the core Passenger processes (core+watchdog), which do not seem to shut down, and the Passenger apps, i.e. dashboard or shell, which do get cleaned up. Once the Passenger apps have been cleaned up, it should be safe to kill also the core Passenger processes and nginx.

We also looked at some potential alternatives, but using ps seems much simpler.
Passenger does have some tools for checking what Passenger apps/processes are running.
For example, passenger-config list-instances --json would list all Passenger instances running, i.e. one instance per user.
Then for each instance, you'd get information about the running apps/processes of some instances using e.g. passenger-status --no-header --show=xml <instance name>.
However, this seemed is perhaps a bit overkill for now, but might be worth considering for the future.

One thing to consider is that the ps solution relies on passenger_pool_idle_time, i.e. Passenger's own cleanup, which does not take into account background tasks, such as file transfers if the user has closed that browser tab. Long term, that would probably be good to solve, but it seems to be a much more difficult problem.

@CSC-swesters
Copy link
Contributor

Thanks for fixing this! We tried out the new nginx_stage nginx_clean and the first impressions are that it seem to do what it is supposed to do. Cannot say anything about how it scales yet, but theoretically, it should be better than before 🙂

@johrstrom is this something which could be backported to the 3.1.x branch? We're interested in taking the fixed version into production as soon as possible, since we've currently disabled the old version from running completely.

@johrstrom
Copy link
Contributor

@johrstrom is this something which could be backported to the 3.1.x

Yes, this'll be in 3.1.5.

@johrstrom
Copy link
Contributor

Just to update 3.1.5 is in the latest repo right now. We haven't installed at OSC so that's why we haven't promoted to 3.1 repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants