-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
determin inactive users and clean their PUNs too #3942
Conversation
pid_path = PidFile.new NginxStage.pun_pid_path(user: u) | ||
|
||
`kill -s TERM #{pid_path.pid}` | ||
FileUtils.rm_rf(Pathname.new(pid_path.to_s).parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be an rm_rf()
call. I'd prefer to keep it as an rmdir()
instead, since the nginx process will clean up its PID file a little while after receiving the signal. I feel like an rm_rf()
would be at risk of causing unfortunate side effects in the future, if some refactoring mistakes are made, or such.
I'll suggest a slightly different approach separately from this comment.
@@ -192,6 +192,21 @@ def self.active_users | |||
end.compact | |||
end | |||
|
|||
# List of inactive users. | |||
# @return [Array<String>] the list of inactive users. | |||
def self.inactive_users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a separate function for enumerating inactive_users
causes duplicate work. It would probably be possible to gather two lists of results in one single walk-through? I.e., put valid_users
in one list if the User.new(name)
function doesn't raise an error, or put the name in an invalid_users
list if it raises. Then return a list of two lists.
I suppose this approach would require a few code changes in other places, but in my opinion, that would be okay. I.e., we would have:
# Pseudo ruby, not sure if this is usable:
currently_active_users = NginxStage.active_users
# sanity check that the length of currently_active_users is 2.
valid_users = currently_active_users[0]
invalid_users = currently_active_users[1]
# Now handle valid and invalid users separately
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I should be changing the API as it's being used in other places. I guess that's the risk that I'm trading for duplication. I.e., I'd rather have duplication than possible errors/mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your concern. However, I think with sufficient test coverage and carefulness, a change like this shouldn't be an issue. nginx_stage
is still quite limited in scope compared to the main ondemand application.
It's not a hill I'll die on though, so I'm willing to let this slide. Perhaps one could make a separate issue for it, so that future refactoring can take it into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the issue is with the 4.0 timeline so I don't have time to write tests or be super careful. If you're happy to defer it, I am too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍 My other review comment I'm more concerned about, so I'm looking forward to your thoughts on that. Hopefully the suggested commit helps.
I tested whether the Some initial ideas that I considered:
While testing this, I noticed that the rescue will be required many, if not all times. So it could be better to just gather a list of paths to Here's a suggestion to fix that: add_hook :delete_puns_of_users_with_no_sessions do
# ... SNIP
pid_parent_dirs_to_remove_later = []
NginxStage.inactive_users.each do |u|
begin
puts "#{u} (disabled)"
pid_path = PidFile.new NginxStage.pun_pid_path(user: u)
# Send a SIGTERM to the master nginx process to kill the PUN.
# 'nginx stop' won't work, since getpwnam(3) will cause an error.
`kill -s TERM #{pid_path.pid}`
FileUtils.rm(NginxStage.pun_secret_key_base_path(user: u).to_s)
FileUtils.rm(NginxStage.pun_config_path(user: u).to_s)
pid_path_parent_dir = Pathname.new(pid_path.to_s).parent
pid_parent_dirs_to_remove_later.push(pid_path_parent_dir)
rescue StandardError => e
warn "Error trying to clean up disabled user #{u}: #{e.message}"
end
end
# Remove the PID path parent directories now that the nginx processes have
# had time to clean up their Passenger PID file and socket.
pid_parent_dirs_to_remove_later.each do | dir |
begin
begin
FileUtils.rmdir(dir)
rescue Errno::ENOTEMPTY
# Wait for a short time, while Nginx cleans up its PID file.
sleep(0.05)
# Then try again once.
FileUtils.rmdir(dir)
end
rescue StandardError => e
warn "Error trying to clean up the PID file directory of disabled user #{u}: #{e.message}"
end
end
end Executing that code works as intended:
The benefit of this approach is that we'll probably sleep only a few times, compared to doing this inside the |
I put the code above into my own fork of your PR branch, so that it's easier to cherry-pick, if you wish: CSC-swesters@bdea796 |
Yea I'm going to think on that sleeping and so on. That loop could be infinite so you'd likely need to count the tries and kick out after so long. It's just complexity that I'm not so keen to take on, so I need just a minute to consider it. |
I consciously tried to avoid causing any loops. Have you found an example case where it would happen? I would think that in case of |
I see, sorry, I didn't read it fully. At a glance, I thought it looped until it could remove the directory. Not just one single retry after a half a second sleep. |
It actually only 50 ms, since that seemed to be enough on my system for removing a single user's PUN and its related directory. If there are more users being cleaned up, the individual PUN instances will have extra time cleaning themselves up while the rest of the PUNs are still receiving their SIGTERM signals. Also, all of the PUN directories waiting to be removed will also benefit from each sleep that happens, making it less likely that subsequent |
- Gather a list of directories which will be empty soon, and rmdir() them after sending the SIGTERM signal to all invalid user PUNs. This will result in the least sleeping, and will avoid using `rm_rf()` as root, which seems very risky.
I've taken your patch, though I increased the sleep duration to a half second just to be a bit more conservative in giving nginx enough time to shut down. |
@@ -111,7 +111,7 @@ class NginxCleanGenerator < Generator | |||
FileUtils.rmdir(dir) | |||
end | |||
rescue StandardError => e | |||
warn "Error trying to clean up the PID file directory of disabled user #{u}: #{e.message}" | |||
warn "Error trying to clean up the PID file #{dir} of disabled user: #{e.message}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the word "directory" should be retained, but otherwise this is okay.
I.e. "Error trying to clean up the PID file directory #{dir} of disabled user: #{e.message}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure i can update this.
Thanks for advancing this PR! I had one additional comment on the error message, but otherwise, I think it should be fine to merge 👍 |
Fixes #3879 by generating a list of inactive/disabled users and cleans up their PUNs and PUN related config files as well.