-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AC/FC integration #2995
base: dev
Are you sure you want to change the base?
AC/FC integration #2995
Conversation
When the agent recognizes it is running in an agent control environment, it will start automatic health checks that will create a new file at a configured destination at a given frequency that provides details about the last reported status of the agent. When the agent is not seen within an agent control environment, files will not be created.
1870dc2
to
671edf2
Compare
- **Feature: Add health checks when the agent runs within Agent Control** | ||
|
||
When the agent is started with a within an agent control environment, automatic health check files will be created within the configured file destination at the configured frequency. [PR#2995](https://github.com/newrelic/newrelic-ruby-agent/pull/2995) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting to hear if there's a product-approved changelog entry, so this may need to be updated.
lib/new_relic/agent/health_check.rb
Outdated
APP_NAME_EXCEEDED = {healthy: false, last_error: 'NR-APM-006', message: 'The maximum number of configured app names (3) exceeded'} | ||
PROXY_CONFIG_ERROR = {healthy: false, last_error: 'NR-APM-007', message: 'HTTP Proxy configuration error; response code [%s]'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec defines these two error codes, but our agent doesn't have any behavior to recognize when these problems occur. As things stand currently, we don't need to update our agent to record these states. I left them here to make sure we match the spec, but would also be open to removing them, since the status will never be updated to use these constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to leave them, to be complete. Plus, if we ever update the agent to handle those things, then these will already be here ready to use. And if we never do make those updates, at least in the future when we're comparing the code to the spec to look into something we won't be confused by why some would be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Do you think we would benefit from a comment stating they're not called anywhere?
health_check.send(:write_file) | ||
|
||
assert File.directory?('./health'), 'Directory not found' | ||
assert File.exist?('./health/health-abc123.yml'), 'File not found' # rubocop:disable Minitest/AssertPathExists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why, but the assert path exists assertion would fail here despite the file existing. I decided to just disable the linter.
@@ -198,6 +198,7 @@ def connect(options = {}) | |||
rescue NewRelic::Agent::UnrecoverableAgentException => e | |||
handle_unrecoverable_agent_error(e) | |||
rescue StandardError, Timeout::Error, NewRelic::Agent::ServerConnectionException => e | |||
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've struggled to write tests to verify the status is updated when we expect it to be updated. If we think tests are valuable for these updates, I could use some help!
lib/new_relic/agent/health_check.rb
Outdated
end | ||
|
||
HEALTHY = {healthy: true, last_error: 'NR-APM-000', message: 'Healthy'} | ||
INVALID_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-001', message: 'Invalid liense key (HTTP status code 401)'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of these errors recoverable for the Ruby Agent? i.e. where we could potentially need to set the status back to HEALTHY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! Some of these errors are recoverable. From what I'm seeing in the code, a successful HTTP request made by the NewRelicService
class to an external endpoint would erase the error state. As of now, this is the only place where I'll update the status to HEALTHY
:
https://github.com/newrelic/newrelic-ruby-agent/pull/2995/files#diff-ecdbe8dde33106ec28a206dcdbfca4ea2c647caff89504a82c58dbceaf749e9aR644
However, I may have missed some scenarios and there could be other places where explicitly setting a HEALTHY
status would demonstrate recovery.
Instead of creating a new file at the interval, reuse the same file for the life of the process.
The array was not being correctly destructured, which would raise an error when the status was HTTP_ERROR
The health check status may be updated for other reasons on the CI, which may cause the message to be inaccurate by the time the result is accessed from the hash
There was a bug related to the HTTP_ERROR constant, where the sprintf string manipulation changed the constant to equal the first value it came across. By freezing the constants and dup'ing the status before assigning it, we can avoid this problem.
SimpleCov Report
|
Previously the invalid license key status would be set, but the health check files would not be written until the first request to the web app was made. Now, health checks will begin before the agent officially starts.
…gent into sa-health-check
Resolves #2859