- How does this differ from the old collectd agent?
- What if I am currently using the old collectd agent?
- How can I see the datapoints emitted by the agent to troubleshoot issues?
- How can I see what services the agent has discovered?
- Why do other pods in my Kubernetes cluster get stuck terminating?
- How do I monitor CPU usage for Kubernetes pods that have CPU limits?
Upon its initial release, the new agent, called the SignalFx Smart Agent, is essentially a wrapper application around collectd that adds service discovery and automatic configuration of collectd based on those discovered services. Most of the system metrics are generated by collectd, as well as most application metrics. Configuration of collectd monitors is largely a passthrough to collectd config options, but in a YAML format instead of the collectd custom syntax.
The first main foray outside of collectd was the Kubernetes integration, which uses monitors and observers written purely in Go and run completely independent of collectd. Our new monitors can now be independent of collectd to overcome some of the limitations we have with that tool.
The Smart Agent comes with all of its dependencies bundled, so you will not
need a prior collectd installation. If you are currently using the old collectd
agent, you should uninstall it first before installing the Smart Agent.
To minimize the load on your host, make sure the old collectd instance
does not run alongside the new agent. Running both will use unnecessary
resources.
If you have your own homegrown collectd plugins, you can still use these with
the Smart Agent by using the collectd/custom
monitor. You can reuse your original collectd managed_config
directory's
configuration files by adding the following monitor:
monitors:
- type: collectd/custom
templates:
- {"#from": "/etc/collectd/managed_config/*.conf", flatten: true, raw: true}
We run collectd-python linked against Python 2.7 so any Python plugins will have to be Python 2.7 compatible.
There are two ways, you can either set a config option in the agent to dump
datapoints to the agent logs, or you can use the signalfx-agent tap-dps
subcommand to stream them to a separate console.
To dump datapoints to the logs, set the following config in the agent.yaml config file:
logging:
level: debug
writer:
logDatapoints: true
You can also dump a stream of datapoint to a separate console by running the
signalfx-agent tap-dps
command on the same host as the running agent. Run
signalfx-agent tap-dps -h
for more information.
Run the following command on the host with the agent. (If you are using the
containerized agent, you don't need to use sudo
.)
$ sudo signalfx-agent status endpoints
This command dumps out some text listing the discovered service endpoints that the agent knows about.
When running the agent in K8s, we have seen issues where the prescribed host
filesystem mount to /hostfs
inside the agent pod will prevent termination of
other pods on the same node. It appears to be the same issue described in
https://bugzilla.redhat.com/show_bug.cgi?id=1437952 with fluentd containers.
The best thing to do in this case is to unmount docker/k8s related mounts
inside the agent container by using this container command in the DaemonSet for
the agent instead of the default /bin/signalfx-agent
, as well by adding the
SYS_ADMIN
capability to the agent container:
...
containers:
- command:
- /bin/bash
- -c
- /bin/umount-hostfs-non-persistent; exec /bin/signalfx-agent
name: signalfx-agent
securityContext:
capabilities:
add:
- SYS_ADMIN
...
...
...
The source for the script /bin/umount-hostfs-non-persistent
can be found
here,
but basically it just does a umount
on all of the potentially problematic
mounts that we know of. You can add arguments to the script invocation for
additional directories to unmount if necessary.
Note that in order to unmount filesystems, you must have the SYS_ADMIN
capability. Because it requires such a broad capability, we don't do the
unmounting by default in order to keep the agent's permissions limited.
We need to mount the host filesystem into the agent pod in order to get disk usage metrics for each disk individually on the node, but unfortunately there is no way to be more selective with what gets mounted by K8s.
CPU usage in Kubernetes (or really any environment where process/container CPU
throttling is active) can be a bit tricky since the usual metrics giving a
container's CPU utilization are absolute values of CPU consumed (e.g. the
docker cpu.percent
metric or the container_cpu_utilization
metric from
cAdvisor), without regard for cgroup limits set by K8s and Docker.
See Resource Quality of Service in Kubernetes for an explanation of requests and limits and how they work in K8s. See CFS Bandwidth Control for more low-level information on how K8s limits are imposed via the Linux kernel.
The primary metrics for container CPU limits are:
container_cpu_cfs_throttled_time
: The amount of time (in nanoseconds) that a container's processes have spent throttledcontainer_cpu_usage_seconds_total
: The total amount of time (in nanoseconds) that a container's processes have spent executing -- this metric is equivalent tocontainer_cpu_utilization * 10,000,000
.container_spec_cpu_period
: The CFS period length (in microseconds) -- the length of time for which the CFS scheduler considers process usage. This is typically 100,000 microseconds or 0.1 seconds. This value cannot exceed 1 second.container_spec_cpu_quota
: The CFS quota (in microseconds) -- a process can run for this amount of time within a given CFS period. The value of this for a given container is derived by dividing the millicore limit value by 1000 and multiplying by the CFS period (e.g. a K8s limit of500m
would translate to a quota of 50,000 microseconds assuming the period were 100,000 microseconds.
The first two metrics are cumulative counters that keep growing, so the easiest way to use them is to look at how much they change per second (the default rollup when you look at the metrics in SignalFx). The second two are gauges and generally don't change for the lifetime of the container.
The maximum percentage of time a process can execute in a given second is equal
to container_spec_cpu_quota
/container_spec_cpu_period
. For example, a
process with quota of 50,000µs and a period of 100,000µs could execute for no
more than half a second, every second. More specifically, within each discrete
100ms window within that second, the process can execute no more than 50ms. In
other words, the rate/sec rollup of container_cpu_usage_seconds_total
should
never exceed 500,000,000
nanoseconds with such a limit. Note that the quota
can be larger than the period, which means that a process could consume more
than an entire core's worth of execution per period.
There are two ways that a container process might be exceeding its limit:
-
The process is being throttled continually and it does not have enough CPU to accomplish everything it needs to. The value of
container_cpu_usage_seconds_total
is maxed out for long periods of time based on the formula above. This is a starving process. -
The process is bursty and needs a lot of CPU for short periods, so it might get throttled within a short time window but is always able to complete execution without backing up indefinitely. The process could do things faster if it had a higher limit, but is not starving for CPU.
Case #1 is almost always a bad situation that should be remedied by some combination of 1) optimizing the application, 2) if workload can be distributed, launching more instances of it (horizontal scaling), or 3) increasing the CPU limit (and potentially the CPU request) on a container (vertical scaling). Case #2 may or may not be bad depending on how time-sensitive its workload is.
To monitor case #1, you can use the formula
(container_cpu_usage_seconds_total/10000000)/(conatiner_spec_cpu_quota/container_spec_cpu_period)
to get the percentage of CPU used compared to the limit (0 - 100+). This value can actually exceed 100 because the sampling by the agent is not on a perfectly exact interval.
For case #2 you need to factor in the container_cpu_cfs_throttled_time
metric. The above metric showing usage relative to the limit will be under 100
in this case but that doesn't mean throttling isn't happening. You can simply
look at container_cpu_cfs_throttled_time
using its default rollup of
rate/sec
which will tell you the raw amount of time a container is spending
throttled. If you have many processes/threads in a container, this number
could be very high. You could compare throttle time to usage time with the
formula
container_cpu_cfs_throttled_time/container_cpu_usage_seconds_total
or the equivalent
container_cpu_cfs_throttled_time/(container_cpu_utilization*10000000)
which will tell you the ratio of time the container's processes spent waiting to execute vs the time spent actually executed. Anything over 1 means that the process is spending more time waiting than actually executing.