determin inactive users and clean their PUNs too #3942

johrstrom · 2024-11-07T21:04:20Z

Fixes #3879 by generating a list of inactive/disabled users and cleans up their PUNs and PUN related config files as well.

CSC-swesters · 2024-11-11T08:41:08Z

nginx_stage/lib/nginx_stage/generators/nginx_clean_generator.rb

+          pid_path = PidFile.new NginxStage.pun_pid_path(user: u)
+
+          `kill -s TERM #{pid_path.pid}`
+          FileUtils.rm_rf(Pathname.new(pid_path.to_s).parent)


This doesn't need to be an rm_rf() call. I'd prefer to keep it as an rmdir() instead, since the nginx process will clean up its PID file a little while after receiving the signal. I feel like an rm_rf() would be at risk of causing unfortunate side effects in the future, if some refactoring mistakes are made, or such.

I'll suggest a slightly different approach separately from this comment.

CSC-swesters · 2024-11-11T08:54:35Z

nginx_stage/lib/nginx_stage.rb

@@ -192,6 +192,21 @@ def self.active_users
    end.compact
  end

+  # List of inactive users.
+  # @return [Array<String>] the list of inactive users.
+  def self.inactive_users


Having a separate function for enumerating inactive_users causes duplicate work. It would probably be possible to gather two lists of results in one single walk-through? I.e., put valid_users in one list if the User.new(name) function doesn't raise an error, or put the name in an invalid_users list if it raises. Then return a list of two lists.

I suppose this approach would require a few code changes in other places, but in my opinion, that would be okay. I.e., we would have:

# Pseudo ruby, not sure if this is usable: currently_active_users = NginxStage.active_users # sanity check that the length of currently_active_users is 2. valid_users = currently_active_users[0] invalid_users = currently_active_users[1] # Now handle valid and invalid users separately # ...

I'm not sure I should be changing the API as it's being used in other places. I guess that's the risk that I'm trading for duplication. I.e., I'd rather have duplication than possible errors/mistakes.

I understand your concern. However, I think with sufficient test coverage and carefulness, a change like this shouldn't be an issue. nginx_stage is still quite limited in scope compared to the main ondemand application.

It's not a hill I'll die on though, so I'm willing to let this slide. Perhaps one could make a separate issue for it, so that future refactoring can take it into account?

Yea the issue is with the 4.0 timeline so I don't have time to write tests or be super careful. If you're happy to defer it, I am too.

Sounds good 👍 My other review comment I'm more concerned about, so I'm looking forward to your thoughts on that. Hopefully the suggested commit helps.

CSC-swesters · 2024-11-11T09:56:28Z

I tested whether the FileUtils.rm_rf(Pathname.new(pid_path.to_s).parent) line could be changed to use rmdir() instead, and noticed (as you probably also did) that the PID file takes a moment to get removed by nginx, so the rmdir() raises an Errno::ENOTEMPTY error. While it would work to add a "long enough" (always debatable what this is 🙂 ) static sleep, I wanted to avoid that, in the interest of performance.

Some initial ideas that I considered:

Move the removal of the pun_pid_path parent to a later stage, to give the nginx process a bit more time to process its signal.
If this is not enough, rescue an Errno::ENOTEMPTY once (don't get caught in a wait loop), sleep for a moment (50 ms seemed to work for me, as an initial guess), and then try again.

While testing this, I noticed that the rescue will be required many, if not all times. So it could be better to just gather a list of paths to rmdir() later, and do that at the end of the :delete_puns_of_users_with_no_sessions do block. This way, most of the nginx processes will have time to process their SIGTERM signals before we attempt to rmdir() the PID file's parent directories.

Here's a suggestion to fix that:

    add_hook :delete_puns_of_users_with_no_sessions do
# ... SNIP
      pid_parent_dirs_to_remove_later = []
      NginxStage.inactive_users.each do |u|
        begin
          puts "#{u} (disabled)"
          pid_path = PidFile.new NginxStage.pun_pid_path(user: u)

          # Send a SIGTERM to the master nginx process to kill the PUN.
          # 'nginx stop' won't work, since getpwnam(3) will cause an error.
          `kill -s TERM #{pid_path.pid}`
          FileUtils.rm(NginxStage.pun_secret_key_base_path(user: u).to_s)
          FileUtils.rm(NginxStage.pun_config_path(user: u).to_s)
          pid_path_parent_dir = Pathname.new(pid_path.to_s).parent
          pid_parent_dirs_to_remove_later.push(pid_path_parent_dir)
        rescue StandardError => e
          warn "Error trying to clean up disabled user #{u}: #{e.message}"
        end
      end

      # Remove the PID path parent directories now that the nginx processes have
      # had time to clean up their Passenger PID file and socket.
      pid_parent_dirs_to_remove_later.each do | dir |
        begin
          begin
            FileUtils.rmdir(dir)
          rescue Errno::ENOTEMPTY
            # Wait for a short time, while Nginx cleans up its PID file.
            sleep(0.05)
            # Then try again once.
            FileUtils.rmdir(dir)
          end
        rescue StandardError => e
          warn "Error trying to clean up the PID file directory of disabled user #{u}: #{e.message}"
        end
      end
    end

Executing that code works as intended:

[root@ondemand /]# find /var/run/ondemand-nginx/
/var/run/ondemand-nginx/

# Log into OOD with redacted_user

[root@ondemand /]# pstree -u redacted_user
PassengerAgent─┬─PassengerAgent─┬─ruby───2*[{ruby}]
               │                └─10*[{PassengerAgent}]
               └─4*[{PassengerAgent}]

nginx

# Remove the user account

[root@ondemand /]# getent passwd redacted_user ; echo $?
2
[root@ondemand /]# find /var/run/ondemand-nginx/
/var/run/ondemand-nginx/
/var/run/ondemand-nginx/redacted_user
/var/run/ondemand-nginx/redacted_user/passenger.pid
/var/run/ondemand-nginx/redacted_user/passenger.sock
[root@ondemand /]# /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean
missing PID file: /var/run/ondemand-nginx/true/passenger.pid
redacted_user (disabled)
[root@ondemand /]# find /var/run/ondemand-nginx/
/var/run/ondemand-nginx/

The benefit of this approach is that we'll probably sleep only a few times, compared to doing this inside the NginxStage.inactive_users.each do |u| loop, where we would probably need to sleep in each iteration, which adds up.

CSC-swesters · 2024-11-11T10:11:47Z

I put the code above into my own fork of your PR branch, so that it's easier to cherry-pick, if you wish: CSC-swesters@bdea796

johrstrom · 2024-11-13T20:24:55Z

Yea I'm going to think on that sleeping and so on. That loop could be infinite so you'd likely need to count the tries and kick out after so long.

It's just complexity that I'm not so keen to take on, so I need just a minute to consider it.

CSC-swesters · 2024-11-14T08:51:18Z

Yea I'm going to think on that sleeping and so on. That loop could be infinite so you'd likely need to count the tries and kick out after so long.

It's just complexity that I'm not so keen to take on, so I need just a minute to consider it.

I consciously tried to avoid causing any loops. Have you found an example case where it would happen? I would think that in case of rmdir() failures, my commit sleeps once, and tries again once, and if that fails, it is caught by the outer begin .. rescue block, and skipped over with a warning log. Then the pid_parent_dirs_to_remove_later.each do loop goes on.

johrstrom · 2024-11-14T14:50:47Z

I consciously tried to avoid causing any loops.

I see, sorry, I didn't read it fully. At a glance, I thought it looped until it could remove the directory. Not just one single retry after a half a second sleep.

CSC-swesters · 2024-11-15T07:17:29Z

after a half a second sleep.

It actually only 50 ms, since that seemed to be enough on my system for removing a single user's PUN and its related directory. If there are more users being cleaned up, the individual PUN instances will have extra time cleaning themselves up while the rest of the PUNs are still receiving their SIGTERM signals. Also, all of the PUN directories waiting to be removed will also benefit from each sleep that happens, making it less likely that subsequent rmdir() calls cause an ENOTEMPTY.

- Gather a list of directories which will be empty soon, and rmdir() them after sending the SIGTERM signal to all invalid user PUNs. This will result in the least sleeping, and will avoid using `rm_rf()` as root, which seems very risky.

johrstrom · 2024-12-12T20:54:52Z

I've taken your patch, though I increased the sleep duration to a half second just to be a bit more conservative in giving nginx enough time to shut down.

CSC-swesters · 2024-12-13T13:05:33Z

nginx_stage/lib/nginx_stage/generators/nginx_clean_generator.rb

@@ -111,7 +111,7 @@ class NginxCleanGenerator < Generator
            FileUtils.rmdir(dir)
          end
        rescue StandardError => e
-          warn "Error trying to clean up the PID file directory of disabled user #{u}: #{e.message}"
+          warn "Error trying to clean up the PID file #{dir} of disabled user: #{e.message}"


I think the word "directory" should be retained, but otherwise this is okay.

I.e. "Error trying to clean up the PID file directory #{dir} of disabled user: #{e.message}"

Sure i can update this.

CSC-swesters · 2024-12-13T13:08:26Z

I've taken your patch, though I increased the sleep duration to a half second just to be a bit more conservative in giving nginx enough time to shut down.

Thanks for advancing this PR! I had one additional comment on the error message, but otherwise, I think it should be fine to merge 👍

determin inactive users and clean their PUNs too

acd0817

johrstrom added the needs doc label Nov 7, 2024

osc-bot added the component/nginx_stage label Nov 7, 2024

fix test

79e9338

CSC-swesters suggested changes Nov 11, 2024

View reviewed changes

CSC-swesters and others added 2 commits December 12, 2024 15:51

more conservative sleep time

eb8a11d

reword this messgae

69c8854

CSC-swesters reviewed Dec 13, 2024

View reviewed changes

fixup this warn message

dc2d3c6

johrstrom requested a review from HazelGrant December 13, 2024 17:38

HazelGrant approved these changes Dec 16, 2024

View reviewed changes

johrstrom merged commit f893270 into master Dec 17, 2024
25 checks passed

johrstrom deleted the nginx-clean-disabled-users branch December 17, 2024 15:40

osc-bot mentioned this pull request Dec 17, 2024

determin inactive users and clean their PUNs too OSC/ood-documentation#1039

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

determin inactive users and clean their PUNs too #3942

determin inactive users and clean their PUNs too #3942

johrstrom commented Nov 7, 2024

CSC-swesters Nov 11, 2024

CSC-swesters Nov 11, 2024

johrstrom Nov 13, 2024

CSC-swesters Nov 13, 2024

johrstrom Nov 13, 2024

CSC-swesters Nov 13, 2024

CSC-swesters commented Nov 11, 2024 •

edited

Loading

CSC-swesters commented Nov 11, 2024

johrstrom commented Nov 13, 2024

CSC-swesters commented Nov 14, 2024

johrstrom commented Nov 14, 2024

CSC-swesters commented Nov 15, 2024

johrstrom commented Dec 12, 2024

CSC-swesters Dec 13, 2024

johrstrom Dec 13, 2024

CSC-swesters commented Dec 13, 2024

determin inactive users and clean their PUNs too #3942

determin inactive users and clean their PUNs too #3942

Conversation

johrstrom commented Nov 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CSC-swesters commented Nov 11, 2024 • edited Loading

CSC-swesters commented Nov 11, 2024

johrstrom commented Nov 13, 2024

CSC-swesters commented Nov 14, 2024

johrstrom commented Nov 14, 2024

CSC-swesters commented Nov 15, 2024

johrstrom commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CSC-swesters commented Dec 13, 2024

CSC-swesters commented Nov 11, 2024 •

edited

Loading