Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

determin inactive users and clean their PUNs too #3942

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

johrstrom
Copy link
Contributor

Fixes #3879 by generating a list of inactive/disabled users and cleans up their PUNs and PUN related config files as well.

pid_path = PidFile.new NginxStage.pun_pid_path(user: u)

`kill -s TERM #{pid_path.pid}`
FileUtils.rm_rf(Pathname.new(pid_path.to_s).parent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be an rm_rf() call. I'd prefer to keep it as an rmdir() instead, since the nginx process will clean up its PID file a little while after receiving the signal. I feel like an rm_rf() would be at risk of causing unfortunate side effects in the future, if some refactoring mistakes are made, or such.

I'll suggest a slightly different approach separately from this comment.

@@ -192,6 +192,21 @@ def self.active_users
end.compact
end

# List of inactive users.
# @return [Array<String>] the list of inactive users.
def self.inactive_users
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a separate function for enumerating inactive_users causes duplicate work. It would probably be possible to gather two lists of results in one single walk-through? I.e., put valid_users in one list if the User.new(name) function doesn't raise an error, or put the name in an invalid_users list if it raises. Then return a list of two lists.

I suppose this approach would require a few code changes in other places, but in my opinion, that would be okay. I.e., we would have:

# Pseudo ruby, not sure if this is usable:

currently_active_users = NginxStage.active_users
# sanity check that the length of currently_active_users is 2.
valid_users = currently_active_users[0]
invalid_users = currently_active_users[1]
# Now handle valid and invalid users separately
# ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I should be changing the API as it's being used in other places. I guess that's the risk that I'm trading for duplication. I.e., I'd rather have duplication than possible errors/mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your concern. However, I think with sufficient test coverage and carefulness, a change like this shouldn't be an issue. nginx_stage is still quite limited in scope compared to the main ondemand application.

It's not a hill I'll die on though, so I'm willing to let this slide. Perhaps one could make a separate issue for it, so that future refactoring can take it into account?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the issue is with the 4.0 timeline so I don't have time to write tests or be super careful. If you're happy to defer it, I am too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍 My other review comment I'm more concerned about, so I'm looking forward to your thoughts on that. Hopefully the suggested commit helps.

@CSC-swesters
Copy link
Contributor

CSC-swesters commented Nov 11, 2024

I tested whether the FileUtils.rm_rf(Pathname.new(pid_path.to_s).parent) line could be changed to use rmdir() instead, and noticed (as you probably also did) that the PID file takes a moment to get removed by nginx, so the rmdir() raises an Errno::ENOTEMPTY error. While it would work to add a "long enough" (always debatable what this is 🙂 ) static sleep, I wanted to avoid that, in the interest of performance.

Some initial ideas that I considered:

  • Move the removal of the pun_pid_path parent to a later stage, to give the nginx process a bit more time to process its signal.
  • If this is not enough, rescue an Errno::ENOTEMPTY once (don't get caught in a wait loop), sleep for a moment (50 ms seemed to work for me, as an initial guess), and then try again.

While testing this, I noticed that the rescue will be required many, if not all times. So it could be better to just gather a list of paths to rmdir() later, and do that at the end of the :delete_puns_of_users_with_no_sessions do block. This way, most of the nginx processes will have time to process their SIGTERM signals before we attempt to rmdir() the PID file's parent directories.

Here's a suggestion to fix that:

    add_hook :delete_puns_of_users_with_no_sessions do
# ... SNIP
      pid_parent_dirs_to_remove_later = []
      NginxStage.inactive_users.each do |u|
        begin
          puts "#{u} (disabled)"
          pid_path = PidFile.new NginxStage.pun_pid_path(user: u)

          # Send a SIGTERM to the master nginx process to kill the PUN.
          # 'nginx stop' won't work, since getpwnam(3) will cause an error.
          `kill -s TERM #{pid_path.pid}`
          FileUtils.rm(NginxStage.pun_secret_key_base_path(user: u).to_s)
          FileUtils.rm(NginxStage.pun_config_path(user: u).to_s)
          pid_path_parent_dir = Pathname.new(pid_path.to_s).parent
          pid_parent_dirs_to_remove_later.push(pid_path_parent_dir)
        rescue StandardError => e
          warn "Error trying to clean up disabled user #{u}: #{e.message}"
        end
      end

      # Remove the PID path parent directories now that the nginx processes have
      # had time to clean up their Passenger PID file and socket.
      pid_parent_dirs_to_remove_later.each do | dir |
        begin
          begin
            FileUtils.rmdir(dir)
          rescue Errno::ENOTEMPTY
            # Wait for a short time, while Nginx cleans up its PID file.
            sleep(0.05)
            # Then try again once.
            FileUtils.rmdir(dir)
          end
        rescue StandardError => e
          warn "Error trying to clean up the PID file directory of disabled user #{u}: #{e.message}"
        end
      end
    end

Executing that code works as intended:

[root@ondemand /]# find /var/run/ondemand-nginx/
/var/run/ondemand-nginx/

# Log into OOD with redacted_user

[root@ondemand /]# pstree -u redacted_user
PassengerAgent─┬─PassengerAgent─┬─ruby───2*[{ruby}]
               │                └─10*[{PassengerAgent}]
               └─4*[{PassengerAgent}]

nginx

# Remove the user account

[root@ondemand /]# getent passwd redacted_user ; echo $?
2
[root@ondemand /]# find /var/run/ondemand-nginx/
/var/run/ondemand-nginx/
/var/run/ondemand-nginx/redacted_user
/var/run/ondemand-nginx/redacted_user/passenger.pid
/var/run/ondemand-nginx/redacted_user/passenger.sock
[root@ondemand /]# /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean
missing PID file: /var/run/ondemand-nginx/true/passenger.pid
redacted_user (disabled)
[root@ondemand /]# find /var/run/ondemand-nginx/
/var/run/ondemand-nginx/

The benefit of this approach is that we'll probably sleep only a few times, compared to doing this inside the NginxStage.inactive_users.each do |u| loop, where we would probably need to sleep in each iteration, which adds up.

@CSC-swesters
Copy link
Contributor

I put the code above into my own fork of your PR branch, so that it's easier to cherry-pick, if you wish: CSC-swesters@bdea796

@johrstrom
Copy link
Contributor Author

Yea I'm going to think on that sleeping and so on. That loop could be infinite so you'd likely need to count the tries and kick out after so long.

It's just complexity that I'm not so keen to take on, so I need just a minute to consider it.

@CSC-swesters
Copy link
Contributor

Yea I'm going to think on that sleeping and so on. That loop could be infinite so you'd likely need to count the tries and kick out after so long.

It's just complexity that I'm not so keen to take on, so I need just a minute to consider it.

I consciously tried to avoid causing any loops. Have you found an example case where it would happen? I would think that in case of rmdir() failures, my commit sleeps once, and tries again once, and if that fails, it is caught by the outer begin .. rescue block, and skipped over with a warning log. Then the pid_parent_dirs_to_remove_later.each do loop goes on.

@johrstrom
Copy link
Contributor Author

I consciously tried to avoid causing any loops.

I see, sorry, I didn't read it fully. At a glance, I thought it looped until it could remove the directory. Not just one single retry after a half a second sleep.

@CSC-swesters
Copy link
Contributor

after a half a second sleep.

It actually only 50 ms, since that seemed to be enough on my system for removing a single user's PUN and its related directory. If there are more users being cleaned up, the individual PUN instances will have extra time cleaning themselves up while the rest of the PUNs are still receiving their SIGTERM signals. Also, all of the PUN directories waiting to be removed will also benefit from each sleep that happens, making it less likely that subsequent rmdir() calls cause an ENOTEMPTY.

CSC-swesters and others added 2 commits December 12, 2024 15:51
- Gather a list of directories which will be empty soon, and rmdir() them
  after sending the SIGTERM signal to all invalid user PUNs. This will result
  in the least sleeping, and will avoid using `rm_rf()` as root, which seems
  very risky.
@johrstrom
Copy link
Contributor Author

I've taken your patch, though I increased the sleep duration to a half second just to be a bit more conservative in giving nginx enough time to shut down.

@@ -111,7 +111,7 @@ class NginxCleanGenerator < Generator
FileUtils.rmdir(dir)
end
rescue StandardError => e
warn "Error trying to clean up the PID file directory of disabled user #{u}: #{e.message}"
warn "Error trying to clean up the PID file #{dir} of disabled user: #{e.message}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the word "directory" should be retained, but otherwise this is okay.

I.e. "Error trying to clean up the PID file directory #{dir} of disabled user: #{e.message}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure i can update this.

@CSC-swesters
Copy link
Contributor

I've taken your patch, though I increased the sleep duration to a half second just to be a bit more conservative in giving nginx enough time to shut down.

Thanks for advancing this PR! I had one additional comment on the error message, but otherwise, I think it should be fine to merge 👍

@johrstrom johrstrom requested a review from HazelGrant December 13, 2024 17:38
@johrstrom johrstrom merged commit f893270 into master Dec 17, 2024
25 checks passed
@johrstrom johrstrom deleted the nginx-clean-disabled-users branch December 17, 2024 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nginx_stage nginx_clean: Also clean up PUNs of non-existing users
4 participants