Skip to content

Commit

Permalink
Merge pull request #2995 from newrelic/sa-health-check
Browse files Browse the repository at this point in the history
AC/FC integration
  • Loading branch information
kaylareopelle authored Jan 23, 2025
2 parents 3d4264e + 87a0909 commit a811aa1
Show file tree
Hide file tree
Showing 17 changed files with 689 additions and 3 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@

Previously, when a customer installed the Ruby agent via [Kubernetes APM auto-attach](https://docs.newrelic.com/docs/kubernetes-pixie/kubernetes-integration/installation/k8s-agent-operator/) and also had the Ruby agent listed in their `Gemfile`, the agent version in `Gemfile` would take precedence. Now, the agent version installed by auto-attach takes priority. [PR#3018](https://github.com/newrelic/newrelic-ruby-agent/pull/3018)

- **Feature: Add health checks when the agent runs within Agent Control**

When the agent is started within an [Agent Control](https://docs-preview.newrelic.com/docs/new-relic-agent-control) environment, a health check file will be created at the configured file location for every agent process. By default, this location is: '/newrelic/apm/health'. The health check files will be updated at the configured frequency, which defaults to every five seconds. [PR#2995](https://github.com/newrelic/newrelic-ruby-agent/pull/2995)

- **Bugfix: Stop emitting inaccurate debug-level log about deprecated configuration options**

In the previous major release, we dropped support for `disable_<library_name>` configuration options in favor of `instrumentation.<library_name>`. Previously, a DEBUG level log warning appeared whenever `disable_*` options were set to `true`, even for libraries (e.g. Action Dispatch) without equivalent `instrumentation.*` options:
Expand Down
4 changes: 4 additions & 0 deletions lib/new_relic/agent/agent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
require 'new_relic/coerce'
require 'new_relic/agent/autostart'
require 'new_relic/agent/harvester'
require 'new_relic/agent/health_check'
require 'new_relic/agent/hostname'
require 'new_relic/agent/new_relic_service'
require 'new_relic/agent/pipe_service'
Expand Down Expand Up @@ -88,6 +89,7 @@ def init_basics
end

def init_components
@health_check = HealthCheck.new
@service = NewRelicService.new
@events = EventListener.new
@stats_engine = StatsEngine.new
Expand Down Expand Up @@ -139,6 +141,8 @@ def instance
# Holds all the methods defined on NewRelic::Agent::Agent
# instances
module InstanceMethods
# the agent control health check file generator
attr_reader :health_check
# the statistics engine that holds all the timeslice data
attr_reader :stats_engine
# the transaction sampler that handles recording transactions
Expand Down
3 changes: 3 additions & 0 deletions lib/new_relic/agent/agent_helpers/connect.rb
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ def log_error(error)
def handle_license_error(error)
::NewRelic::Agent.logger.error(error.message,
'Visit newrelic.com to obtain a valid license key, or to upgrade your account.')
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::INVALID_LICENSE_KEY)
disconnect
end

Expand Down Expand Up @@ -191,13 +192,15 @@ def connect(options = {})
@connected_pid = $$
@connect_state = :connected
signal_connected
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::HEALTHY)
rescue NewRelic::Agent::ForceDisconnectException => e
handle_force_disconnect(e)
rescue NewRelic::Agent::LicenseException => e
handle_license_error(e)
rescue NewRelic::Agent::UnrecoverableAgentException => e
handle_unrecoverable_agent_error(e)
rescue StandardError, Timeout::Error, NewRelic::Agent::ServerConnectionException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
retry if retry_from_error?(e, opts)
rescue Exception => e
::NewRelic::Agent.logger.error('Exception of unexpected type during Agent#connect():', e)
Expand Down
3 changes: 3 additions & 0 deletions lib/new_relic/agent/agent_helpers/harvest.rb
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ def send_data_to_endpoint(endpoint, payload, container)
rescue UnrecoverableServerException => e
NewRelic::Agent.logger.warn("#{endpoint} data was rejected by remote service, discarding. Error: ", e)
rescue ServerConnectionException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
log_remote_unavailable(endpoint, e)
container.merge!(payload)
rescue => e
Expand All @@ -133,9 +134,11 @@ def check_for_and_handle_agent_commands
rescue ForceRestartException, ForceDisconnectException
raise
rescue UnrecoverableServerException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
NewRelic::Agent.logger.warn('get_agent_commands message was rejected by remote service, discarding. ' \
'Error: ', e)
rescue ServerConnectionException => e
NewRelic::Agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
log_remote_unavailable(:get_agent_commands, e)
rescue => e
NewRelic::Agent.logger.info('Error during check_for_and_handle_agent_commands, will retry later: ', e)
Expand Down
3 changes: 3 additions & 0 deletions lib/new_relic/agent/agent_helpers/shutdown.rb
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ def shutdown
revert_to_default_configuration

@started = nil

NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::SHUTDOWN) if NewRelic::Agent.agent.health_check.healthy?

Control.reset
end

Expand Down
1 change: 1 addition & 0 deletions lib/new_relic/agent/agent_helpers/start_worker_thread.rb
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ def handle_force_restart(error)
# is the worker thread that gathers data and talks to the
# server.
def handle_force_disconnect(error)
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FORCED_DISCONNECT)
::NewRelic::Agent.logger.warn('Agent received a ForceDisconnectException from the server, disconnecting. ' \
"(#{error.message})")
disconnect
Expand Down
7 changes: 7 additions & 0 deletions lib/new_relic/agent/agent_helpers/startup.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ def start
# setting up the worker thread and the exit handler to shut
# down the agent
def check_config_and_start_agent
# some health statuses, such as invalid license key, are ran before
# the agent officially starts
@health_check.create_and_run_health_check_loop
return unless monitoring? && has_correct_license_key?
return if using_forking_dispatcher?

Expand Down Expand Up @@ -129,6 +132,7 @@ def monitoring?
if Agent.config[:monitor_mode]
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::AGENT_DISABLED)
::NewRelic::Agent.logger.warn('Agent configured not to send data in this environment.')
false
end
Expand All @@ -140,6 +144,7 @@ def has_license_key?
if Agent.config[:license_key] && Agent.config[:license_key].length > 0
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::MISSING_LICENSE_KEY)
::NewRelic::Agent.logger.warn('No license key found. ' +
'This often means your newrelic.yml file was not found, or it lacks a section for the running ' \
"environment, '#{NewRelic::Control.instance.env}'. You may also want to try linting your newrelic.yml " \
Expand All @@ -160,6 +165,7 @@ def correct_license_length
if key.length == 40
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::INVALID_LICENSE_KEY)
::NewRelic::Agent.logger.error("Invalid license key: #{key}")
false
end
Expand All @@ -180,6 +186,7 @@ def agent_should_start?
end

unless app_name_configured?
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::MISSING_APP_NAME)
NewRelic::Agent.logger.error('No application name configured.',
'The agent cannot start without at least one. Please check your ',
'newrelic.yml and ensure that it is valid and has at least one ',
Expand Down
22 changes: 22 additions & 0 deletions lib/new_relic/agent/configuration/default_source.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2188,6 +2188,28 @@ def self.notify
:transform => DefaultSource.method(:convert_to_constant_list),
:description => 'Specify a list of exceptions you do not want the agent to strip when [strip_exception_messages](#strip_exception_messages-enabled) is `true`. Separate exceptions with a comma. For example, `"ImportantException,PreserveMessageException"`.'
},
# Agent Control
:'agent_control.enabled' => {
:default => false,
:public => false,
:type => Boolean,
:allowed_from_server => false,
:description => 'Boolean value that denotes whether Agent Control functionality should be enabled. At the moment, this functionality is limited to whether agent health should be reported. This configuration will be set using an environment variable by Agent Control, or one of its components, prior to agent startup.'
},
:'agent_control.health.delivery_location' => {
:default => '/newrelic/apm/health',
:public => false,
:type => String,
:allowed_from_server => false,
:description => 'A `file:` URI that specifies the fully qualified directory path for health file(s) to be written to. This defaults to: `file:///newrelic/apm/health`. This configuration will be set using an environment variable by Agent Control, or one of its components, prior to agent startup.'
},
:'agent_control.health.frequency' => {
:default => 5,
:public => false,
:type => Integer,
:allowed_from_server => false,
:description => 'The interval, in seconds, of how often the health file(s) will be written to. This configuration will be set using an environment variable by Agent Control, or one of its components, prior to agent startup.'
},
# Thread profiler
:'thread_profiler.enabled' => {
:default => DefaultSource.thread_profiler_enabled,
Expand Down
2 changes: 2 additions & 0 deletions lib/new_relic/agent/configuration/yaml_source.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def initialize(path, env)
erb_file = process_erb(raw_file)
config = process_yaml(erb_file, env, config, @file_path)
rescue ScriptError, StandardError => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_PARSE_CONFIG)
log_failure("Failed to read or parse configuration file at #{path}", e)
end

Expand Down Expand Up @@ -99,6 +100,7 @@ def process_erb(file)
file.gsub!(/^\s*#.*$/, '#')
ERB.new(file).result(binding)
rescue ScriptError, StandardError => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_PARSE_CONFIG)
message = 'Failed ERB processing configuration file. This is typically caused by a Ruby error in <% %> templating blocks in your newrelic.yml file.'
failure_array = [message, e]
failure_array << e.backtrace[0] if Gem::Version.new(RUBY_VERSION) >= Gem::Version.new('3.4.0')
Expand Down
136 changes: 136 additions & 0 deletions lib/new_relic/agent/health_check.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# This file is distributed under New Relic's license terms.
# See https://github.com/newrelic/newrelic-ruby-agent/blob/main/LICENSE for complete details.
# frozen_string_literal: true

module NewRelic
module Agent
class HealthCheck
def initialize
@start_time = nano_time
@continue = true
@status = HEALTHY
# the following assignments may set @continue = false if they are invalid
set_enabled
set_delivery_location
set_frequency
end

HEALTHY = {healthy: true, last_error: 'NR-APM-000', message: 'Healthy'}.freeze
INVALID_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-001', message: 'Invalid license key (HTTP status code 401)'}.freeze
MISSING_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-002', message: 'License key missing in configuration'}.freeze
FORCED_DISCONNECT = {healthy: false, last_error: 'NR-APM-003', message: 'Forced disconnect received from New Relic (HTTP status code 410)'}.freeze
HTTP_ERROR = {healthy: false, last_error: 'NR-APM-004', message: 'HTTP error response code [%s] recevied from New Relic while sending data type [%s]'}.freeze
MISSING_APP_NAME = {healthy: false, last_error: 'NR-APM-005', message: 'Missing application name in agent configuration'}.freeze
APP_NAME_EXCEEDED = {healthy: false, last_error: 'NR-APM-006', message: 'The maximum number of configured app names (3) exceeded'}.freeze
PROXY_CONFIG_ERROR = {healthy: false, last_error: 'NR-APM-007', message: 'HTTP Proxy configuration error; response code [%s]'}.freeze
AGENT_DISABLED = {healthy: false, last_error: 'NR-APM-008', message: 'Agent is disabled via configuration'}.freeze
FAILED_TO_CONNECT = {healthy: false, last_error: 'NR-APM-009', message: 'Failed to connect to New Relic data collector'}.freeze
FAILED_TO_PARSE_CONFIG = {healthy: false, last_error: 'NR-APM-010', message: 'Agent config file is not able to be parsed'}.freeze
SHUTDOWN = {healthy: true, last_error: 'NR-APM-099', message: 'Agent has shutdown'}.freeze

def create_and_run_health_check_loop
return unless health_checks_enabled? && @continue

NewRelic::Agent.logger.debug('Agent Control health check conditions met. Starting health checks.')
NewRelic::Agent.record_metric('Supportability/AgentControl/Health/enabled', 1)

Thread.new do
while @continue
begin
sleep @frequency
write_file
@continue = false if @status == SHUTDOWN
rescue StandardError => e
NewRelic::Agent.logger.error("Aborting Agent Control health check. Error raised: #{e}")
@continue = false
end
end
end
end

def update_status(status, options = [])
return unless @continue

@status = status.dup
update_message(options) unless options.empty?
end

def healthy?
@status == HEALTHY
end

private

def set_enabled
@enabled = if ENV['NEW_RELIC_AGENT_CONTROL_ENABLED'] == 'true'
true
else
NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_ENABLED not true, disabling health checks')
@continue = false
false
end
end

def set_delivery_location
@delivery_location = if ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION']
# The spec states file paths for the delivery location will begin with file://
# This does not create a valid path in Ruby, so remove the prefix when present
ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION']&.gsub('file://', '')
else
# The spec default is 'file:///newrelic/apm/health', but since we're just going to remove it anyway...
'/newrelic/apm/health'
end
end

def set_frequency
@frequency = ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY'] ? ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY'].to_i : 5

if @frequency <= 0
NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY zero or less, disabling health checks')
@continue = false
end
end

def contents
<<~CONTENTS
healthy: #{@status[:healthy]}
status: #{@status[:message]}#{last_error}
start_time_unix_nano: #{@start_time}
status_time_unix_nano: #{nano_time}
CONTENTS
end

def last_error
@status[:healthy] ? '' : "\nlast_error: #{@status[:last_error]}"
end

def nano_time
Process.clock_gettime(Process::CLOCK_REALTIME, :nanosecond)
end

def file_name
"health-#{NewRelic::Agent::GuidGenerator.generate_guid(32)}.yml"
end

def write_file
@file ||= "#{@delivery_location}/#{file_name}"

File.write(@file, contents)
rescue StandardError => e
NewRelic::Agent.logger.error("Agent Control health check raised an error while writing a file: #{e}")
@continue = false
end

def health_checks_enabled?
@enabled && @delivery_location && @frequency > 0
end

def update_message(options)
@status[:message] = sprintf(@status[:message], *options)
rescue StandardError => e
NewRelic::Agent.logger.debug("Error raised while updating Agent Control health check message: #{e}." \
"options = #{options}, @status[:message] = #{@status[:message]}")
end
end
end
end
10 changes: 8 additions & 2 deletions lib/new_relic/agent/new_relic_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,8 @@ def attempt_request(request, opts)
end

def handle_error_response(response, endpoint)
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::HTTP_ERROR, [response.code, endpoint])

case response
when Net::HTTPRequestTimeOut,
Net::HTTPTooManyRequests,
Expand Down Expand Up @@ -637,9 +639,13 @@ def check_post_size(post_string, endpoint)
def send_request(opts)
request = prep_request(opts)
response = relay_request(request, opts)
return response if response.is_a?(Net::HTTPSuccess) || response.is_a?(Net::HTTPAccepted)

handle_error_response(response, opts[:endpoint])
if response.is_a?(Net::HTTPSuccess) || response.is_a?(Net::HTTPAccepted)
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::HEALTHY)
response
else
handle_error_response(response, opts[:endpoint])
end
end

def log_response(response)
Expand Down
7 changes: 7 additions & 0 deletions test/agent_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,13 @@ def assert_log_contains(log, message)
"Could not find message: '#{message.inspect}'. Log contained: #{lines.join("\n")}"
end

def refute_log_contains(log, message)
lines = log.array

refute (lines.any? { |line| line.match(message) }),
"Found message: '#{message.inspect}'. Log contained: #{lines.join("\n")}"
end

def assert_audit_log_contains(audit_log_contents, needle)
# Original request bodies dumped to the log have symbol keys, but once
# they go through a dump/load, they're strings again, so we strip
Expand Down
Loading

0 comments on commit a811aa1

Please sign in to comment.