Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AC/FC integration #2995

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from
Open

AC/FC integration #2995

wants to merge 11 commits into from

Conversation

kaylareopelle
Copy link
Contributor

Resolves #2859

When the agent recognizes it is running in an agent control
environment, it will start automatic health checks that will
create a new file at a configured destination at a given
frequency that provides details about the last reported status
of the agent.

When the agent is not seen within an agent control environment,
files will not be created.
@kaylareopelle kaylareopelle changed the title WIP: AC/FC integration AC/FC integration Jan 7, 2025
Comment on lines +9 to +11
- **Feature: Add health checks when the agent runs within Agent Control**

When the agent is started with a within an agent control environment, automatic health check files will be created within the configured file destination at the configured frequency. [PR#2995](https://github.com/newrelic/newrelic-ruby-agent/pull/2995)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting to hear if there's a product-approved changelog entry, so this may need to be updated.

lib/new_relic/agent/configuration/default_source.rb Outdated Show resolved Hide resolved
lib/new_relic/agent/health_check.rb Outdated Show resolved Hide resolved
Comment on lines 25 to 26
APP_NAME_EXCEEDED = {healthy: false, last_error: 'NR-APM-006', message: 'The maximum number of configured app names (3) exceeded'}
PROXY_CONFIG_ERROR = {healthy: false, last_error: 'NR-APM-007', message: 'HTTP Proxy configuration error; response code [%s]'}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec defines these two error codes, but our agent doesn't have any behavior to recognize when these problems occur. As things stand currently, we don't need to update our agent to record these states. I left them here to make sure we match the spec, but would also be open to removing them, since the status will never be updated to use these constants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to leave them, to be complete. Plus, if we ever update the agent to handle those things, then these will already be here ready to use. And if we never do make those updates, at least in the future when we're comparing the code to the spec to look into something we won't be confused by why some would be missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Do you think we would benefit from a comment stating they're not called anywhere?

health_check.send(:write_file)

assert File.directory?('./health'), 'Directory not found'
assert File.exist?('./health/health-abc123.yml'), 'File not found' # rubocop:disable Minitest/AssertPathExists
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why, but the assert path exists assertion would fail here despite the file existing. I decided to just disable the linter.

@@ -198,6 +198,7 @@ def connect(options = {})
rescue NewRelic::Agent::UnrecoverableAgentException => e
handle_unrecoverable_agent_error(e)
rescue StandardError, Timeout::Error, NewRelic::Agent::ServerConnectionException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've struggled to write tests to verify the status is updated when we expect it to be updated. If we think tests are valuable for these updates, I could use some help!

@kaylareopelle kaylareopelle marked this pull request as ready for review January 7, 2025 01:16
end

HEALTHY = {healthy: true, last_error: 'NR-APM-000', message: 'Healthy'}
INVALID_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-001', message: 'Invalid liense key (HTTP status code 401)'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these errors recoverable for the Ruby Agent? i.e. where we could potentially need to set the status back to HEALTHY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! Some of these errors are recoverable. From what I'm seeing in the code, a successful HTTP request made by the NewRelicService class to an external endpoint would erase the error state. As of now, this is the only place where I'll update the status to HEALTHY:
https://github.com/newrelic/newrelic-ruby-agent/pull/2995/files#diff-ecdbe8dde33106ec28a206dcdbfca4ea2c647caff89504a82c58dbceaf749e9aR644

However, I may have missed some scenarios and there could be other places where explicitly setting a HEALTHY status would demonstrate recovery.

Instead of creating a new file at the interval, reuse the same file for
the life of the process.
The array was not being correctly destructured, which would raise an
error when the status was HTTP_ERROR
The health check status may be updated for other reasons on the CI,
which may cause the message to be inaccurate by the time the result is
accessed from the hash
There was a bug related to the HTTP_ERROR constant, where the sprintf
string manipulation changed the constant to equal the first value it
came across. By freezing the constants and
dup'ing the status before assigning it,
we can avoid this problem.
Copy link
Contributor

github-actions bot commented Jan 7, 2025

SimpleCov Report

Coverage Threshold
Line 93.56% 93%

kaylareopelle and others added 3 commits January 7, 2025 16:45
Previously the invalid license key status would be set, but the
health check files would not be written until the first request to the
web app was made.

Now, health checks will begin before the agent officially starts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AC/FC Integration Support
3 participants