copyright

lastupdated

keywords

subcollection

years
2014, 2019

2019-06-12

kubernetes, iks, logmet, logs, metrics

containers

{:new_window: target="_blank"} {:shortdesc: .shortdesc} {:screen: .screen} {:pre: .pre} {:table: .aria-labeledby="caption"} {:codeblock: .codeblock} {:tip: .tip} {:note: .note} {:important: .important} {:deprecated: .deprecated} {:download: .download} {:preview: .preview}

Logging and monitoring

{: #health}

Set up logging and monitoring in {{site.data.keyword.containerlong}} to help you troubleshoot issues and improve the health and performance of your Kubernetes clusters and apps. {: shortdesc}

Continuous monitoring and logging is the key to detecting attacks on your cluster and troubleshooting issues as they arise. By continuously monitoring your cluster, you're able to better understand your cluster capacity and the availability of resources that are available to your app. With this insight, you can prepare to protect your apps against downtime. Note: To configure logging and monitoring, you must use a standard cluster in {{site.data.keyword.containerlong_notm}}.

Choosing a logging solution

{: #logging_overview}

By default, logs are generated and written locally for all of the following {{site.data.keyword.containerlong_notm}} cluster components: worker nodes, containers, applications, persistent storage, Ingress application load balancer, Kubernetes API, and the kube-system namespace. Several logging solutions are available to collect, forward, and view these logs. {: shortdesc}

You can choose your logging solution based on which cluster components you need to collect logs for. A common implementation is to choose a logging service that you prefer based on its analysis and interface capabilities, such as {{site.data.keyword.loganalysisfull}}, {{site.data.keyword.la_full}}, or a third-party service. You can then use {{site.data.keyword.cloudaccesstrailfull}} to audit user activity in the cluster and backup cluster master logs to {{site.data.keyword.cos_full}}. Note: To configure logging, you must have a standard Kubernetes cluster.

{{site.data.keyword.la_full_notm}}

Manage pod container logs by deploying LogDNA as a third-party service to your cluster. To use {{site.data.keyword.la_full_notm}}, you must deploy a logging agent to every worker node in your cluster. This agent collects logs with the extension `*.log` and extensionless files that are stored in the `/var/log` directory of your pod from all namespaces, including `kube-system`. The agent then forwards the logs to the {{site.data.keyword.la_full_notm}} service. For more information about the service, see the [{{site.data.keyword.la_full_notm}}](/docs/services/Log-Analysis-with-LogDNA?topic=LogDNA-about) documentation. To get started, see [Managing Kubernetes cluster logs with {{site.data.keyword.loganalysisfull_notm}} with LogDNA](/docs/services/Log-Analysis-with-LogDNA/tutorials?topic=LogDNA-kube#kube).

Fluentd with {{site.data.keyword.loganalysisfull_notm}}

Previously, you could create a logging configuration to forward logs that are collected by the Fluentd cluster component to {{site.data.keyword.loganalysisfull_notm}}. As of 30 April 2019, you cannot provision new {{site.data.keyword.loganalysisshort_notm}} instances, and all Lite plan instances are deleted. Existing premium plan instances are supported until 30 September 2019. To continue collecting logs for your cluster, you must set up {{site.data.keyword.la_full_notm}} or change your configuration to forward logs to an external server.

Fluentd with an external server

To collect, forward, and view logs for a cluster component, you can create a logging configuration by using Fluentd. When you create a logging configuration, the [Fluentd ![External link icon](../icons/launch-glyph.svg "External link icon")](https://www.fluentd.org/) cluster component collects logs from the paths for a specified source. Fluentd can then forward these logs to an external server that accepts a syslog protocol. To get started, see [Understanding cluster and app log forwarding to syslog](#logging).

{{site.data.keyword.cloudaccesstrailfull_notm}}

To monitor user-initiated administrative activity made in your cluster, you can collect and forward audit logs to {{site.data.keyword.cloudaccesstrailfull_notm}}. Clusters generate two types of {{site.data.keyword.cloudaccesstrailshort}} events.

Cluster management events are automatically generated and forwarded to {{site.data.keyword.cloudaccesstrailshort}}.
Kubernetes API server audit events are automatically generated, but you must [create a logging configuration](#api_forward) so that Fluentd can forward these logs to {{site.data.keyword.cloudaccesstrailshort}}.

For more information about the types of {{site.data.keyword.containerlong_notm}} events that you can track, see [Activity Tracker events](/docs/containers?topic=containers-at_events). For more information about the service, see the [Activity Tracker](/docs/services/cloud-activity-tracker?topic=cloud-activity-tracker-getting-started) documentation.

{{site.data.keyword.containerlong_notm}} is currently not configured to use {{site.data.keyword.at_full}}. To manage cluster management events and Kubernetes API audit logs, continue to use {{site.data.keyword.cloudaccesstrailfull_notm}} with LogAnalysis.

{{site.data.keyword.cos_full_notm}}

To collect, forward, and view logs for your cluster's Kubernetes master, you can take a snapshot of your master logs at any point in time to collect in an {{site.data.keyword.cos_full_notm}} bucket. The snapshot includes anything that is sent through the API server, such as pod scheduling, deployments, or RBAC policies. To get started, see [Collecting master logs](#collect_master).

Third-party services

If you have special requirements, you can set up your own logging solution. Check out third-party logging services that you can add to your cluster in [Logging and monitoring integrations](/docs/containers?topic=containers-supported_integrations#health_services). You can collect container logs from the `/var/log/pods/` path.

Forwarding cluster and app logs to {{site.data.keyword.la_full_notm}}

{: #logdna}

Manage pod container logs by deploying LogDNA as a third-party service to your cluster. {: shortdesc}

To use {{site.data.keyword.la_full_notm}}, you must deploy a logging agent to every worker node in your cluster. This agent collects logs with the extension *.log and extensionless files that are stored in the /var/log directory of your pod from all namespaces, including kube-system. The agent then forwards the logs to the {{site.data.keyword.la_full_notm}} service. For more information about the service, see the {{site.data.keyword.la_full_notm}} documentation. To get started, see Managing Kubernetes cluster logs with {{site.data.keyword.loganalysisfull_notm}} with LogDNA.

Deprecated: Forwarding cluster, app, and Kubernetes API audit logs to {{site.data.keyword.loganalysisfull_notm}}

{: #loga}

Previously, you could create a logging configuration to forward logs that are collected by the Fluentd cluster component to {{site.data.keyword.loganalysisfull_notm}}. As of 30 April 2019, {{site.data.keyword.loganalysisfull_notm}} is deprecated. You cannot provision new {{site.data.keyword.loganalysisshort_notm}} instances, and all Lite plan instances are deleted. Existing premium plan instances are supported until 30 September 2019. {: deprecated}

To continue collecting logs for your cluster, you have the following options:

Set up {{site.data.keyword.la_full_notm}}. For more information, see Transitioning to {{site.data.keyword.la_full_notm}}.
Change your configuration to forward logs to an external server.

For more information about existing {{site.data.keyword.loganalysisshort_notm}} instances, see the {{site.data.keyword.loganalysisshort_notm}} documentation.

Forwarding cluster, app, and Kubernetes API audit logs to an external server

{: #configuring}

Configure log forwarding for {{site.data.keyword.containerlong_notm}} standard clusters to an external server. {: shortdesc}

Understanding log forwarding to an external server

{: #logging}

By default, logs are collected by the Fluentd External link icon add-on in your cluster. When you create a logging configuration for a source in your cluster such as a container, the logs that Fluentd collects from that source's paths are forwarded to an external server. The traffic from the source to the logging service on the ingestion port is encrypted. {: shortdesc}

What are the sources that I can configure log forwarding for?

In the following image, you can see the location of the sources that you can configure logging for.

worker: Information that is specific to the infrastructure configuration that you have for your worker node. Worker logs are captured in syslog and contain operating system events. In auth.log you can find information on the authentication requests that are made to the OS.
Paths:
- /var/log/syslog
- /var/log/auth.log
container: Information that is logged by a running container.
Paths: Anything that is written to STDOUT or STDERR.
application: Information about events that occur at the application level. This could be a notification that an event took place such as a successful login, a warning about storage, or other operations that can be performed at the app level.
Paths: You can set the paths that your logs are forwarded to. However, in order for logs to be sent, you must use an absolute path in your logging configuration or the logs cannot be read. If your path is mounted to your worker node, it might have created a symlink. Example: If the specified path is /usr/local/spark/work/app-0546/0/stderr but the logs actually go to /usr/local/spark-1.0-hadoop-1.2/work/app-0546/0/stderr, then the logs cannot be read.
storage: Information about persistent storage that is set up in your cluster. Storage logs can help you set up problem determination dashboards and alerts as part of your DevOps pipeline and production releases. Note: The paths /var/log/kubelet.log and /var/log/syslog also contain storage logs, but logs from these paths are collected by the kubernetes and worker log sources.
Paths:
- /var/log/ibmc-s3fs.log
- /var/log/ibmc-block.log

Pods: * portworx-*** * ibmcloud-block-storage-attacher-*** * ibmcloud-block-storage-driver-*** * ibmcloud-block-storage-plugin-*** * ibmcloud-object-storage-plugin-***

kubernetes: Information from the kubelet, the kube-proxy, and other Kubernetes events that happen in the kube-system namespace of the worker node.
Paths:
- /var/log/kubelet.log
- /var/log/kube-proxy.log
- /var/log/event-exporter/1..log
kube-audit: Information about cluster-related actions that is sent to the Kubernetes API server, including the time, the user, and the affected resource.
ingress: Information about the network traffic that comes into a cluster through the Ingress ALB.
Paths:
- /var/log/alb/ids/*.log
- /var/log/alb/ids/*.err
- /var/log/alb/customerlogs/*.log
- /var/log/alb/customerlogs/*.err

What configuration options do I have?

The following table shows the different options that you have when you configure logging and their descriptions.

Understanding logging configuration options

Option	Description
`<cluster_name_or_ID>`	The name or ID of the cluster.
`--log_source`	The source that you want to forward logs from. Accepted values are `container`, `application`, `worker`, `kubernetes`, `ingress`, `storage`, and `kube-audit`. This argument supports a comma-separated list of log sources to apply to the configuration. If you do not provide a log source, logging configurations are created for `container` and `ingress` log sources.
`--type syslog`	The value `syslog` forwards your logs to an external server.
`--namespace`	Optional: The Kubernetes namespace that you want to forward logs from. Log forwarding is not supported for the `ibm-system` and `kube-system` Kubernetes namespaces. This value is valid only for the `container` log source. If you do not specify a namespace, then all namespaces in the cluster use this configuration.
`--hostname`	For {{site.data.keyword.loganalysisshort_notm}}, use the [ingestion URL](/docs/services/CloudLogAnalysis?topic=cloudloganalysis-log_ingestion#log_ingestion_urls). If you do not specify an ingestion URL, the endpoint for the region in which you created your cluster is used. For syslog, specify the host name or IP address of the log collector service.
`--port`	The ingestion port. If you do not specify a port, then the standard port `9091` is used. For syslog, specify the port of the log collector server. If you do not specify a port, then the standard port `514` is used.
`--app-containers`	Optional: To forward logs from apps, you can specify the name of the container that contains your app. You can specify more than one container by using a comma-separated list. If no containers are specified, logs are forwarded from all of the containers that contain the paths that you provided.
`--app-paths`	The path on a container that the apps log to. To forward logs with source type `application`, you must provide a path. To specify more than one path, use a comma-separated list. Example: `/var/log/myApp1/,/var/log/myApp2/`
`--syslog-protocol`	When the logging type is `syslog`, the transport layer protocol. You can use the following protocols: `udp`, `tls`, or `tcp`. When forwarding to a rsyslog server with the `udp` protocol, logs that are over 1KB are truncated.
`--ca-cert`	Required: When the logging type is `syslog` and the protocol is `tls`, the Kubernetes secret name that contains the Certificate Authority certificate.
`--verify-mode`	When the logging type is `syslog` and the protocol is `tls`, the verification mode. Supported values are `verify-peer` and the default `verify-none`.
`--skip-validation`	Optional: Skip the validation of the org and space names when they are specified. Skipping validation decreases processing time, but an invalid logging configuration will not correctly forward logs.

Am I responsible for keeping Fluentd updated?

In order to change your logging or filter configurations, the Fluentd logging add-on must be at the latest version. By default, automatic updates to the add-on are enabled. To disable automatic updates, see Updating cluster add-ons: Fluentd for logging.

Can I forward some logs, but not others, from one source in my cluster?

Yes. For example, if you have a particularly chatty pod, you might want to prevent logs from that pod from taking up log storage space, while still allowing other pods' logs to be forwarded. To prevent logs from a specific pod from being forwarded, see Filtering logs.

Forwarding cluster and app logs

{: #enable-forwarding}

Create a configuration for cluster and app logging. You can differentiate between the different logging options by using flags. {: shortdesc}

Forwarding logs to your own server over the udp or tcp protocols

Ensure that you have the Editor or Administrator {{site.data.keyword.cloud_notm}} IAM platform role.
For the cluster where the log source is located: Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.
Set up a server that accepts a syslog protocol in 1 of 2 ways:

Set up and manage your own server or have a provider manage it for you. If a provider manages the server for you, get the logging endpoint from the logging provider.
Run syslog from a container. For example, you can use this deployment .yaml file to fetch a Docker public image that runs a container in your cluster. The image publishes the port 514 on the public cluster IP address, and uses this public cluster IP address to configure the syslog host.

You can see your logs as valid JSON by removing syslog prefixes. To do so, add the following code to the top of your etc/rsyslog.conf file where your rsyslog server is running: $template customFormat,"%msg%\n" $ActionFileDefaultTemplate customFormat {: tip}

Create a log forwarding configuration.

ibmcloud ks logging-config-create --cluster <cluster_name_or_ID> --logsource <log_source> --namespace <kubernetes_namespace> --hostname <log_server_hostname_or_IP> --port <log_server_port> --type syslog --app-containers <containers> --app-paths <paths_to_logs> --syslog-protocol <protocol> --skip-validation

{: pre}

Forwarding logs to your own server over the tls protocol

The following steps are general instructions. Prior to using the container in a production environment, be sure that any security requirements that you need, are met. {: tip}

Ensure that you have the following {{site.data.keyword.cloud_notm}} IAM roles:
- Editor or Administrator platform role for the cluster
- Writer or Manager service role for the kube-system namespace
For the cluster where the log source is located: Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.
Set up a server that accepts a syslog protocol in 1 of 2 ways:

Set up and manage your own server or have a provider manage it for you. If a provider manages the server for you, get the logging endpoint from the logging provider.
Run syslog from a container. For example, you can use this deployment .yaml file to fetch a Docker public image that runs a container in your cluster. The image publishes the port 514 on the public cluster IP address, and uses this public cluster IP address to configure the syslog host. You need to inject the relevant Certificate Authority and server-side certificates and update the syslog.conf to enable tls on your server.

Save your Certificate Authority certificate to a file named ca-cert. It must be that exact name.
Create a secret in the kube-system namespace for the ca-cert file. When you create your logging configuration, use the secret name for the --ca-cert flag.
```
kubectl -n kube-system create secret generic --from-file=ca-cert
```
{: pre}

Create a log forwarding configuration.

ibmcloud ks logging-config-create --cluster <cluster name or id> --logsource <log source> --type syslog --syslog-protocol tls --hostname <ip address of syslog server> --port <port for syslog server, 514 is default> --ca-cert <secret name> --verify-mode <defaults to verify-none>

{: pre}

Forwarding Kubernetes API audit logs

{: #audit_enable}

To audit any events that are passed through your Kubernetes API server, you can create a configuration to forward events to your external server. {: shortdesc}

For more information about Kubernetes audit logs, see the auditing topic External link icon in the Kubernetes documentation.

Currently, a default audit policy is used for all clusters with this logging configuration.
Currently, filters are not supported.
There can be only one kube-audit configuration per cluster, but you can forward logs to {{site.data.keyword.cloudaccesstrailshort}} and an external server by creating a logging configuration and a webhook.
You must have the Administrator {{site.data.keyword.cloud_notm}} IAM platform role for the cluster.

Before you begin

Set up a remote logging server where you can forward the logs. For example, you can use Logstash with Kubernetes to collect audit events.
For the cluster that you want to collect API server audit logs from: Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.

To forward Kubernetes API audit logs:

Set up the webhook. If you do not provide any information in the flags, a default configuration is used.

ibmcloud ks apiserver-config-set audit-webhook <cluster_name_or_ID> --remoteServer <server_URL_or_IP> --caCert <CA_cert_path> --clientCert <client_cert_path> --clientKey <client_key_path>

{: pre}

Understanding this command's components

Understanding this command's components
`<cluster_name_or_ID>`	The name or ID of the cluster.
`<server_URL>`	The URL or IP address for the remote logging service that you want to send logs to. Certificates are ignored if you provide an unsecure server URL.
`<CA_cert_path>`	The file path for the CA certificate that is used to verify the remote logging service.
`<client_cert_path>`	The file path for the client certificate that is used to authenticate against the remote logging service.
`<client_key_path>`	The file path for the corresponding client key that is used to connect to the remote logging service.

Verify that log forwarding was enabled by viewing the URL for the remote logging service.
```
ibmcloud ks apiserver-config-get audit-webhook <cluster_name_or_ID>
```
{: pre}

Example output:
```
OK
Server:			https://8.8.8.8
```
{: screen}
Apply the configuration update by restarting the Kubernetes master.
```
ibmcloud ks apiserver-refresh --cluster <cluster_name_or_ID>
```
{: pre}
Optional: If you want to stop forwarding audit logs, you can disable your configuration.
1. For the cluster that you want to stop collecting API server audit logs from: Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.
2. Disable the webhook back-end configuration for the cluster's API server.
```
ibmcloud ks apiserver-config-unset audit-webhook <cluster_name_or_ID>
```
  {: pre}
3. Apply the configuration update by restarting the Kubernetes master.
```
ibmcloud ks apiserver-refresh --cluster <cluster_name_or_ID>
```
  {: pre}

Filtering logs that are forwarded

{: #filter-logs}

You can choose which logs to forward to your external server by filtering out specific logs for a period of time. You can differentiate between the different filtering options by using flags. {: shortdesc}

Understanding the options for log filtering

Understanding log filtering options
<cluster_name_or_ID>	Required: The name or ID of the cluster that you want to filter logs for.
`<log_type>`	The type of logs that you want to apply the filter to. Currently `all`, `container`, and `host` are supported.
`<configs>`	Optional: A comma-separated list of your logging configuration IDs. If not provided, the filter is applied to all of the cluster logging configurations that are passed to the filter. You can view log configurations that match the filter by using the `--show-matching-configs` option.
`<kubernetes_namespace>`	Optional: The Kubernetes namespace that you want to forward logs from. This flag applies only when you are using log type `container`.
`<container_name>`	Optional: The name of the container from which you want to filter logs.
`<logging_level>`	Optional: Filters out logs that are at the specified level and less. Acceptable values in their canonical order are `fatal`, `error`, `warn/warning`, `info`, `debug`, and `trace`. As an example, if you filtered logs at the `info` level, `debug`, and `trace` are also filtered. Note: You can use this flag only when log messages are in JSON format and contain a level field. To display your messages in JSON, append the `--json` flag to the command.
`<message>`	Optional: Filters out logs that contain a specified message that is written as a regular expression.
`<filter_ID>`	Optional: The ID of the log filter.
`--show-matching-configs`	Optional: Show the logging configurations that each filter applies to.
`--all`	Optional: Delete all of your log forwarding filters.

Create a logging filter.

ibmcloud ks logging-filter-create --cluster <cluster_name_or_ID> --type <log_type> --logging-configs <configs> --namespace <kubernetes_namespace> --container <container_name> --level <logging_level> --regex-message <message>

{: pre}

View the log filter that you created.

ibmcloud ks logging-filter-get --cluster <cluster_name_or_ID> --id <filter_ID> --show-matching-configs

{: pre}

Update the log filter that you created.

ibmcloud ks logging-filter-update --cluster <cluster_name_or_ID> --id <filter_ID> --type <server_type> --logging-configs <configs> --namespace <kubernetes_namespace --container <container_name> --level <logging_level> --regex-message <message>

{: pre}

Delete a log filter that you created.

ibmcloud ks logging-filter-rm --cluster <cluster_name_or_ID> --id <filter_ID> [--all]

{: pre}

Verifying, updating, and deleting log forwarding

{: #verifying-log-forwarding}

Verifying
You can verify that your configuration is set up correctly in 1 of 2 ways:

To list all of the logging configurations in a cluster:

ibmcloud ks logging-config-get --cluster <cluster_name_or_ID>

{: pre}

To list the logging configurations for one type of log source:

ibmcloud ks logging-config-get --cluster <cluster_name_or_ID> --logsource <source>

{: pre}

Updating
You can update a logging configuration that you already created:

ibmcloud ks logging-config-update --cluster <cluster_name_or_ID> --id <log_config_id> --namespace <namespace> --type <server_type> --syslog-protocol <protocol> --logsource <source> --hostname <hostname_or_ingestion_URL> --port <port> --space <cluster_space> --org <cluster_org> --app-containers <containers> --app-paths <paths_to_logs>

{: pre}

Deleting
You can stop forwarding logs one or all of the logging configurations for a cluster:

To delete one logging configuration:

ibmcloud ks logging-config-rm --cluster <cluster_name_or_ID> --id <log_config_ID>

{: pre}

To delete all of the logging configurations:

ibmcloud ks logging-config-rm --cluster <my_cluster> --all

{: pre}

Forwarding Kubernetes API audit logs to {{site.data.keyword.cloudaccesstrailfull_notm}}

{: #api_forward}

Kubernetes automatically audits any events that are passed through your Kubernetes API server. You can forward the events to {{site.data.keyword.cloudaccesstrailfull_notm}}. {: shortdesc}

For more information about Kubernetes audit logs, see the auditing topic External link icon in the Kubernetes documentation.

Currently, a default audit policy is used for all clusters with this logging configuration.
Currently, filters are not supported.
There can be only one kube-audit configuration per cluster, but you can forward logs to {{site.data.keyword.cloudaccesstrailshort}} and an external server by creating a logging configuration and a webhook.
You must have the Administrator {{site.data.keyword.cloud_notm}} IAM platform role for the cluster.

{{site.data.keyword.containerlong_notm}} is currently not configured to use {{site.data.keyword.at_full}}. To manage Kubernetes API audit logs, continue to use {{site.data.keyword.cloudaccesstrailfull_notm}} with LogAnalysis. {: note}

Before you begin

Verify permissions. If you specified a space when you created the cluster, then both the account owner and {{site.data.keyword.containerlong_notm}} key owner need Manager, Developer, or Auditor permissions in that space.
For the cluster that you want to collect API server audit logs from: Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.

Forwarding logs

Create a logging configuration.

ibmcloud ks logging-config-create --cluster <cluster_name_or_ID> --logsource kube-audit --space <cluster_space> --org <cluster_org> --hostname <ingestion_URL> --type ibm

{: pre}

Example command and output:

ibmcloud ks logging-config-create --cluster myCluster --logsource kube-audit
Creating logging configuration for kube-audit logs in cluster myCluster...
OK
Id                                     Source      Namespace   Host                                   Port     Org    Space   Server Type   Protocol  Application Containers   Paths
14ca6a0c-5bc8-499a-b1bd-cedcf40ab850   kube-audit    -         ingest-au-syd.logging.bluemix.net✣    9091✣     -       -         ibm          -              -                  -

✣ Indicates the default endpoint for the {{site.data.keyword.loganalysisshort_notm}} service.

{: screen}

Understanding this command's components

Understanding this command's components
`<cluster_name_or_ID>`	The name or ID of the cluster.
`<ingestion_URL>`	The endpoint where you want to forward logs. If you do not specify an [ingestion URL](/docs/services/CloudLogAnalysis?topic=cloudloganalysis-log_ingestion#log_ingestion_urls), the endpoint for the region in which you created your cluster is used.
`<cluster_space>`	Optional: The name of the Cloud Foundry space that you want to send logs to. When forwarding logs to {{site.data.keyword.loganalysisshort_notm}}, the space and org are specified in the ingestion point. If you do not specify a space, logs are sent to the account level.
`<cluster_org>`	The name of the Cloud Foundry org that the space is in. This value is required if you specified a space.

View your cluster logging configuration to verify that it was implemented the way that you intended.

ibmcloud ks logging-config-get --cluster <cluster_name_or_ID>

{: pre}

Example command and output:

ibmcloud ks logging-config-get --cluster myCluster
Retrieving cluster myCluster logging configurations...
OK
Id                                     Source        Namespace   Host                                 Port    Org   Space   Server Type  Protocol  Application Containers   Paths
a550d2ba-6a02-4d4d-83ef-68f7a113325c   container     *           ingest-au-syd.logging.bluemix.net✣  9091✣   -     -         ibm           -          -              -
14ca6a0c-5bc8-499a-b1bd-cedcf40ab850   kube-audit    -           ingest-au-syd.logging.bluemix.net✣  9091✣   -     -         ibm           -          -              -

{: screen}

To view the Kubernetes API audit events that you forward:
Log in to your {{site.data.keyword.cloud_notm}} account.
From the catalog, provision an instance of the {{site.data.keyword.cloudaccesstrailshort}} service in the same account as your instance of {{site.data.keyword.containerlong_notm}}.
On the Manage tab of the {{site.data.keyword.cloudaccesstrailshort}} dashboard, select the account or space domain. * Account logs: Cluster management events and Kubernetes API server audit events are available in the account domain for the {{site.data.keyword.cloud_notm}} region where the events are generated. * Space logs: If you specified a space when you configured your logging configuration in step 2, these events are available in the space domain that is associated with the Cloud Foundry space where the {{site.data.keyword.cloudaccesstrailshort}} service is provisioned.
Click View in Kibana.
Set the time frame that you want to view logs for. The default is 24 hours.
To narrow your search, you can click the edit icon for ActivityTracker_Account_Search_in_24h and add fields in the Available Fields column.

To let other users view account and space events, see Granting permissions to see account events. {: tip}

Collecting master logs in an {{site.data.keyword.cos_full_notm}} bucket

{: #collect_master}

With {{site.data.keyword.containerlong_notm}}, you can take a snapshot of your master logs at any point in time to collect in an {{site.data.keyword.cos_full_notm}} bucket. The snapshot includes anything that is sent through the API server, such as pod scheduling, deployments, or RBAC policies. {: shortdesc}

Because Kubernetes API Server logs are automatically streamed, they're also automatically deleted to make room for the new logs coming in. By keeping a snapshot of logs at a specific point in time, you can better troubleshoot issues, look into usage differences, and find patterns to help maintain more secure applications.

Before you begin

Provision an instance of {{site.data.keyword.cos_short}} from the {{site.data.keyword.cloud_notm}} catalog.
Ensure that you have the Administrator {{site.data.keyword.cloud_notm}} IAM platform role for the cluster.

Creating a snapshot

Create an Object Storage bucket through the {{site.data.keyword.cloud_notm}} console by following this getting started tutorial.
Generate HMAC service credentials in the bucket that you created.
In the Service Credentials tab of the {{site.data.keyword.cos_short}} dashboard, click New Credential.
Give the HMAC credentials the Writer service role.
In the Add Inline Configuration Parameters field, specify {"HMAC":true}.
Through the CLI, make a request for a snapshot of your master logs.

ibmcloud ks logging-collect --cluster <cluster name or ID> --cos-bucket <COS_bucket_name> --cos-endpoint <location_of_COS_bucket> --hmac-key-id <HMAC_access_key_ID> --hmac-key <HMAC_access_key>

{: pre}

Understanding this command's components

Understanding this command's components
`--cluster <cluster_name_or_ID>`	The name or ID of the cluster.
`--cos-bucket <COS_bucket_name>`	The name of the {{site.data.keyword.cos_short}} bucket that you want to store your logs in.
`--cos-endpoint <location_of_COS_bucket>`	The regional, cross regional, or single data center {{site.data.keyword.cos_short}} endpoint for the bucket that you are storing your logs in. For available endpoints, see [Endpoints and storage locations](/docs/services/cloud-object-storage/basics?topic=cloud-object-storage-endpoints) in the {{site.data.keyword.cos_short}} documentation.
`--hmac-key-id <HMAC_access_key_ID>`	The unique ID for your HMAC credentials for your {{site.data.keyword.cos_short}} instance.
`--hmac-key <HMAC_access_key>`	The HMAC key for your {{site.data.keyword.cos_short}} instance.

Example command and response:

ibmcloud ks logging-collect --cluster mycluster --cos-bucket mybucket --cos-endpoint s3-api.us-geo.objectstorage.softlayer.net --hmac-key-id e2e7f5c9fo0144563c418dlhi3545m86 --hmac-key c485b9b9fo4376722f692b63743e65e1705301ab051em96j
There is no specified log type. The default master will be used.
Submitting log collection request for master logs for cluster mycluster...
OK
The log collection request was successfully submitted. To view the status of the request run ibmcloud ks logging-collect-status mycluster.

{: screen}

Check the status of your request. It can take some time for the snapshot to complete, but you can check to see whether your request is successfully being completed or not. You can find the name of the file that contains your master logs in the response and use the {{site.data.keyword.cloud_notm}} console to download the file.

ibmcloud ks logging-collect-status --cluster <cluster_name_or_ID>

{: pre}

Example output:

ibmcloud ks logging-collect-status --cluster mycluster
Getting the status of the last log collection request for cluster mycluster...
OK
State     Start Time             Error   Log URLs
success   2018-09-18 16:49 PDT   - s3-api.us-geo.objectstorage.softlayer.net/mybucket/master-0-0862ae70a9ae6c19845ba3pc0a2a6o56-1297318756.tgz
s3-api.us-geo.objectstorage.softlayer.net/mybucket/master-1-0862ae70a9ae6c19845ba3pc0a2a6o56-1297318756.tgz
s3-api.us-geo.objectstorage.softlayer.net/mybucket/master-2-0862ae70a9ae6c19845ba3pc0a2a6o56-1297318756.tgz

{: screen}

Choosing a monitoring solution

{: #view_metrics}

Metrics help you monitor the health and performance of your clusters. You can use the standard Kubernetes and container runtime features to monitor the health of your clusters and apps. Note: Monitoring is supported only for standard clusters. {:shortdesc}

Does IBM monitor my cluster?

Every Kubernetes master is continuously monitored by IBM. {{site.data.keyword.containerlong_notm}} automatically scans every node where the Kubernetes master is deployed for vulnerabilities that are found in Kubernetes and OS-specific security fixes. If vulnerabilities are found, {{site.data.keyword.containerlong_notm}} automatically applies fixes and resolves vulnerabilities on behalf of the user to ensure master node protection. You are responsible for monitoring and analyzing the logs for the rest of your cluster components.

To avoid conflicts when using metrics services, be sure that clusters across resource groups and regions have unique names. {: tip}

{{site.data.keyword.mon_full_notm}}

Gain operational visibility into the performance and health of your apps by deploying Sysdig as a third-party service to your worker nodes to forward metrics to {{site.data.keyword.monitoringlong}}. For more information, see [Analyzing metrics for an app that is deployed in a Kubernetes cluster](/docs/services/Monitoring-with-Sysdig/tutorials?topic=Sysdig-kubernetes_cluster#kubernetes_cluster).

Kubernetes dashboard

The Kubernetes dashboard is an administrative web interface where you can review the health of your worker nodes, find Kubernetes resources, deploy containerized apps, and troubleshoot apps with logging and monitoring information. For more information about how to access your Kubernetes dashboard, see [Launching the Kubernetes dashboard for {{site.data.keyword.containerlong_notm}}](/docs/containers?topic=containers-app#cli_dashboard).

Deprecated: Metrics dashboard in cluster overview page of {{site.data.keyword.cloud_notm}} console and output of ibmcloud ks cluster-get

{{site.data.keyword.containerlong_notm}} provides information about the health and capacity of your cluster and the usage of your cluster resources. You can use this console to scale out your cluster, work with your persistent storage, and add more capabilities to your cluster through {{site.data.keyword.cloud_notm}} service binding. To view metrics, go to the **Kubernetes** > **Clusters** dashboard, select a cluster, and click the **Metrics** link.

The link to the metrics dashboard in the cluster overview page of the {{site.data.keyword.cloud_notm}} console and in the output of `ibmcloud ks cluster-get` is deprecated. Clusters that are created after 03 May 2019 are not created with the metrics dashboard link. Clusters that are created on or before 03 May 2019 continue to have the link to the metrics dashboard.

{{site.data.keyword.monitoringlong_notm}}

Metrics for standard clusters are located in the {{site.data.keyword.cloud_notm}} account that was logged in to when the Kubernetes cluster was created. If you specified an {{site.data.keyword.cloud_notm}} space when you created the cluster, then metrics are located in that space. Container metrics are collected automatically for all containers that are deployed in a cluster. These metrics are sent and are made available through Grafana. For more information about metrics, see [Monitoring for the {{site.data.keyword.containerlong_notm}}](/docs/services/cloud-monitoring/containers?topic=cloud-monitoring-monitoring_bmx_containers_ov#monitoring_bmx_containers_ov).

To access the Grafana dashboard, go to one of the following URLs and select the {{site.data.keyword.cloud_notm}} account or space where you created the cluster.

IP addresses to open for monitoring traffic

{{site.data.keyword.containerlong_notm}} region	Monitoring address	Monitoring subnets
EU Central	`metrics.eu-de.bluemix.net`	`158.177.65.80/30`
UK South	`metrics.eu-gb.bluemix.net`	`169.50.196.136/29`
US East, US South, AP North, AP South	`metrics.ng.bluemix.net`	`169.47.204.128/29`

Other health monitoring tools

{: #health_tools}

You can configure other tools for more monitoring capabilities.

Prometheus: Prometheus is an open source monitoring, logging, and alerting tool that was designed for Kubernetes. The tool retrieves detailed information about the cluster, worker nodes, and deployment health based on the Kubernetes logging information. For more information about the setup, see the [CoreOS instructions ![External link icon](../icons/launch-glyph.svg "External link icon")](https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus).

Viewing cluster states

{: #states}

Review the state of a Kubernetes cluster to get information about the availability and capacity of the cluster, and potential problems that might occur. {:shortdesc}

To view information about a specific cluster, such as its zones, service endpoint URLs, Ingress subdomain, version, and owner, use the ibmcloud ks cluster-get --cluster <cluster_name_or_ID> command. Include the --showResources flag to view more cluster resources such as add-ons for storage pods or subnet VLANs for public and private IPs.

You can review information about the overall cluster, the IBM-managed master, and your worker nodes. To troubleshoot your cluster and worker nodes, see Troubleshooting clusters.

Cluster states

{: #states_cluster}

You can view the current cluster state by running the ibmcloud ks clusters command and locating the State field. {: shortdesc}

Cluster states

Cluster state	Description
`Aborted`	The deletion of the cluster is requested by the user before the Kubernetes master is deployed. After the deletion of the cluster is completed, the cluster is removed from your dashboard. If your cluster is stuck in this state for a long time, open an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-cs_troubleshoot#ts_getting_help).
`Critical`	The Kubernetes master cannot be reached or all worker nodes in the cluster are down.
`Delete failed`	The Kubernetes master or at least one worker node cannot be deleted.
`Deleted`	The cluster is deleted but not yet removed from your dashboard. If your cluster is stuck in this state for a long time, open an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-cs_troubleshoot#ts_getting_help).
`Deleting`	The cluster is being deleted and cluster infrastructure is being dismantled. You cannot access the cluster.
`Deploy failed`	The deployment of the Kubernetes master could not be completed. You cannot resolve this state. Contact IBM Cloud support by opening an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-cs_troubleshoot#ts_getting_help).
`Deploying`	The Kubernetes master is not fully deployed yet. You cannot access your cluster. Wait until your cluster is fully deployed to review the health of your cluster.
`Normal`	All worker nodes in a cluster are up and running. You can access the cluster and deploy apps to the cluster. This state is considered healthy and does not require an action from you. Although the worker nodes might be normal, other infrastructure resources, such as [networking](/docs/containers?topic=containers-cs_troubleshoot_network) and [storage](/docs/containers?topic=containers-cs_troubleshoot_storage), might still need attention. If you just created the cluster, some parts of the cluster that are used by other services such as Ingress secrets or registry image pull secrets, might still be in process.
`Pending`	The Kubernetes master is deployed. The worker nodes are being provisioned and are not available in the cluster yet. You can access the cluster, but you cannot deploy apps to the cluster.
`Requested`	A request to create the cluster and order the infrastructure for the Kubernetes master and worker nodes is sent. When the deployment of the cluster starts, the cluster state changes to `Deploying`. If your cluster is stuck in the `Requested` state for a long time, open an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-cs_troubleshoot#ts_getting_help).
`Updating`	The Kubernetes API server that runs in your Kubernetes master is being updated to a new Kubernetes API version. During the update, you cannot access or change the cluster. Worker nodes, apps, and resources that the user deployed are not modified and continue to run. Wait for the update to complete to review the health of your cluster.
`Unsupported`	The [Kubernetes version](/docs/containers?topic=containers-cs_versions#cs_versions) that the cluster runs is no longer supported. Your cluster's health is no longer actively monitored or reported. Additionally, you cannot add or reload worker nodes. To continue receiving important security updates and support, you must update your cluster. Review the [version update preparation actions](/docs/containers?topic=containers-cs_versions#prep-up), then [update your cluster](/docs/containers?topic=containers-update#update) to a supported Kubernetes version. Clusters that are three or more versions behind the oldest supported version cannot be updated. To avoid this situation, you can update the cluster to a Kubernetes version less than three ahead of the current version, such as 1.12 to 1.14. Further, if your cluster runs version 1.5, 1.7, or 1.8, then the version is too far behind to update. Instead, you must [create a cluster](/docs/containers?topic=containers-clusters#clusters) and [deploy your apps](/docs/containers?topic=containers-app#app) to the cluster.
`Warning`	At least one worker node in the cluster is not available, but other worker nodes are available and can take over the workload.

Master states

{: #states_master}

Your {{site.data.keyword.containerlong_notm}} includes an IBM-managed master with highly available replicas, automatic security patch updates applied for you, and automation in place to recover in case of an incident. You can check the health, status, and state of the cluster master by running ibmcloud ks cluster-get --cluster <cluster_name_or_ID>. {: shortdesc}

Master Health
The Master Health reflects the state of master components and notifies you if something needs your attention. The health might be one of the following:

error: The master is not operational. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master is normal.
normal: The master is operational and healthy. No action is required.
unavailable: The master might not be accessible, which means some actions such as resizing a worker pool are temporarily unavailable. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master is normal.
unsupported: The master runs an unsupported version of Kubernetes. You must update your cluster to return the master to normal health.

Master Status and State
The Master Status provides details of what operation from the master state is in progress. The status includes a timestamp of how long the master has been in the same state, such as Ready (1 month ago). The Master State reflects the lifecycle of possible operations that can be performed on the master, such as deploying, updating, and deleting. Each state is described in the following table.

Master states

Master state	Description
`deployed`	The master is successfully deployed. Check the status to verify that the master is `Ready` or to see if an update is available.
`deploying`	The master is currently deploying. Wait for the state to become `deployed` before working with your cluster, such as adding worker nodes.
`deploy_failed`	The master failed to deploy. IBM Support is notified and works to resolve the issue. Check the Master Status field for more information, or wait for the state to become `deployed`.
`deleting`	The master is currently deleting because you deleted the cluster. You cannot undo a deletion. After the cluster is deleted, you can no longer check the master state because the cluster is completely removed.
`delete_failed`	The master failed to delete. IBM Support is notified and works to resolve the issue. You cannot resolve the issue by trying to delete the cluster again. Instead, check the Master Status field for more information, or wait for the cluster to delete.
`updating`	The master is updating its Kubernetes version. The update might be a patch update that is automatically applied, or a minor or major version that you applied by updating the cluster. During the update, your highly available master can continue processing requests, and your app workloads and worker nodes continue to run. After the master update is complete, you can [update your worker nodes](/docs/containers?topic=containers-update#worker_node). If the update is unsuccessful, the master returns to a `deployed` state and continues running the previous version. IBM Support is notified and works to resolve the issue. You can check if the update failed in the Master Status field.

Worker node states

{: #states_workers}

You can view the current worker node state by running the ibmcloud ks workers --cluster <cluster_name_or_ID command and locating the State and Status fields. {: shortdesc}

Worker node states

Worker node state	Description
`Critical`	A worker node can go into a Critical state for many reasons: You initiated a reboot for your worker node without cordoning and draining your worker node. Rebooting a worker node can cause data corruption in `containerd`, `kubelet`, `kube-proxy`, and `calico`. The pods that are deployed to your worker node do not use resource limits for [memory ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/) and [CPU ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/). Without resource limits, pods can consume all available resources, leaving no resources for other pods to run on this worker node. This overcommitment of workload causes the worker node to fail. `containerd`, `kubelet`, or `calico` went into an unrecoverable state after it ran hundreds or thousands of containers over time. You set up a Virtual Router Appliance for your worker node that went down and cut off the communication between your worker node and the Kubernetes master. Current networking issues in {{site.data.keyword.containerlong_notm}} or IBM Cloud infrastructure (SoftLayer) that causes the communication between your worker node and the Kubernetes master to fail. Your worker node ran out of capacity. Check the Status of the worker node to see whether it shows Out of disk or Out of memory. If your worker node is out of capacity, consider to either reduce the workload on your worker node or add a worker node to your cluster to help load balance the workload. The device was powered off from the [{{site.data.keyword.cloud_notm}} console resource list ![External link icon](../icons/launch-glyph.svg "External link icon")](https://cloud.ibm.com/resources). Open the resource list and find your worker node ID in the Devices list. In the action menu, click Power On. In many cases, [reloading](/docs/containers?topic=containers-cli-plugin-kubernetes-service-cli#cs_worker_reload) your worker node can solve the problem. When you reload your worker node, the latest [patch version](/docs/containers?topic=containers-cs_versions#version_types) is applied to your worker node. The major and minor version is not changed. Before you reload your worker node, make sure to cordon and drain your worker node to ensure that the existing pods are terminated gracefully and rescheduled onto remaining worker nodes. If reloading the worker node does not resolve the issue, go to the next step to continue troubleshooting your worker node. Tip: You can [configure health checks for your worker node and enable Autorecovery](/docs/containers?topic=containers-health#autorecovery). If Autorecovery detects an unhealthy worker node based on the configured checks, Autorecovery triggers a corrective action like an OS reload on the worker node. For more information about how Autorecovery works, see the [Autorecovery blog ![External link icon](../icons/launch-glyph.svg "External link icon")](https://www.ibm.com/blogs/bluemix/2017/12/autorecovery-utilizes-consistent-hashing-high-availability/).
`Deployed`	Updates are successfully deployed to your worker node. After updates are deployed, {{site.data.keyword.containerlong_notm}} starts a health check on the worker node. After the health check is successful, the worker node goes into a `Normal` state. Worker nodes in a `Deployed` state usually are ready to receive workloads, which you can check by running `kubectl get nodes` and confirming that the state shows `Normal`.
`Deploying`	When you update the Kubernetes version of your worker node, your worker node is redeployed to install the updates. If you reload or reboot your worker node, the worker node is redeployed to automatically install the latest patch version. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the deployment.
`Normal`	Your worker node is fully provisioned and ready to be used in the cluster. This state is considered healthy and does not require an action from the user. Note: Although the worker nodes might be normal, other infrastructure resources, such as [networking](/docs/containers?topic=containers-cs_troubleshoot_network) and [storage](/docs/containers?topic=containers-cs_troubleshoot_storage), might still need attention.
`Provisioning`	Your worker node is being provisioned and is not available in the cluster yet. You can monitor the provisioning process in the Status column of your CLI output. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the provisioning.
`Provision_failed`	Your worker node could not be provisioned. Continue with the next step to find the details for the failure.
`Reloading`	Your worker node is being reloaded and is not available in the cluster. You can monitor the reloading process in the Status column of your CLI output. If your worker node is stuck in this state for a long time, continue with the next step to see whether a problem occurred during the reloading.
`Reloading_failed`	Your worker node could not be reloaded. Continue with the next step to find the details for the failure.
`Reload_pending`	A request to reload or to update the Kubernetes version of your worker node is sent. When the worker node is being reloaded, the state changes to `Reloading`.
`Unknown`	The Kubernetes master is not reachable for one of the following reasons: You requested an update of your Kubernetes master. The state of the worker node cannot be retrieved during the update. If the worker node remains in this state for an extended period of time even after the Kubernetes master is successfully updated, try to [reload](/docs/containers?topic=containers-cli-plugin-kubernetes-service-cli#cs_worker_reload) the worker node. You might have another firewall that is protecting your worker nodes, or changed firewall settings recently. {{site.data.keyword.containerlong_notm}} requires certain IP addresses and ports to be opened to allow communication from the worker node to the Kubernetes master and vice versa. For more information, see [Firewall prevents worker nodes from connecting](/docs/containers?topic=containers-cs_troubleshoot_clusters#cs_firewall). The Kubernetes master is down. Contact {{site.data.keyword.cloud_notm}} support by opening an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-cs_troubleshoot#ts_getting_help).
`Warning`	Your worker node is reaching the limit for memory or disk space. You can either reduce work load on your worker node or add a worker node to your cluster to help load balance the work load.

Configuring health monitoring for worker nodes with Autorecovery

{: #autorecovery}

The Autorecovery system uses various checks to query worker node health status. If Autorecovery detects an unhealthy worker node based on the configured checks, Autorecovery triggers a corrective action like an OS reload on the worker node. Only one worker node undergoes a corrective action at a time. The worker node must successfully complete the corrective action before any other worker node undergoes a corrective action. For more information, see this Autorecovery blog post External link icon . {: shortdesc}

Autorecovery requires at least one healthy node to function properly. Configure Autorecovery with active checks only in clusters with two or more worker nodes. {: note}

Before you begin:

Ensure that you have the following {{site.data.keyword.cloud_notm}} IAM roles:
- Administrator platform role for the cluster
- Writer or Manager service role for the kube-system namespace
Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.

To configure Autorecovery:

Follow the instructions to install the Helm client on your local machine, install the Helm server (tiller) with a service account, and add the {{site.data.keyword.cloud_notm}} Helm repository.

Verify that tiller is installed with a service account.

kubectl get serviceaccount -n kube-system | grep tiller

{: pre}

Example output:

NAME                                 SECRETS   AGE
tiller                               1         2m

{: screen}

Create a configuration map file that defines your checks in JSON format. For example, the following YAML file defines three checks: an HTTP check and two Kubernetes API server checks. Refer to the tables following the example YAML file for information about the three kinds of checks and information about the individual components of the checks.

**Tip:** Define each check as a unique key in the `data` section of the configuration map.

kind: ConfigMap
apiVersion: v1
metadata:
  name: ibm-worker-recovery-checks
  namespace: kube-system
data:
  checknode.json: |
    {
      "Check":"KUBEAPI",
      "Resource":"NODE",
      "FailureThreshold":3,
      "CorrectiveAction":"RELOAD",
      "CooloffSeconds":1800,
      "IntervalSeconds":180,
      "TimeoutSeconds":10,
      "Enabled":true
    }
  checkpod.json: |
    {
      "Check":"KUBEAPI",
      "Resource":"POD",
      "PodFailureThresholdPercent":50,
      "FailureThreshold":3,
      "CorrectiveAction":"RELOAD",
      "CooloffSeconds":1800,
      "IntervalSeconds":180,
      "TimeoutSeconds":10,
      "Enabled":true
    }
  checkhttp.json: |
    {
      "Check":"HTTP",
      "FailureThreshold":3,
      "CorrectiveAction":"REBOOT",
      "CooloffSeconds":1800,
      "IntervalSeconds":180,
      "TimeoutSeconds":10,
      "Port":80,
      "ExpectedStatus":200,
      "Route":"/myhealth",
      "Enabled":false
    }

{:codeblock}

Understanding the configmap components

Understanding the configmap components
`name`	The configuration name `ibm-worker-recovery-checks` is a constant and cannot be changed.
`namespace`	The `kube-system` namespace is a constant and cannot be changed.
`checknode.json`	Defines a Kubernetes API node check that checks whether each worker node is in the `Ready` state. The check for a specific worker node counts as a failure if the worker node is not in the `Ready` state. The check in the example YAML runs every 3 minutes. If it fails three consecutive times, the worker node is reloaded. This action is equivalent to running `ibmcloud ks worker-reload`. The node check is enabled until you set the Enabled field to `false` or remove the check.
`checkpod.json`	Defines a Kubernetes API pod check that checks the total percentage of `NotReady` pods on a worker node based on the total pods that are assigned to that worker node. The check for a specific worker node counts as a failure if the total percentage of `NotReady` pods is greater than the defined `PodFailureThresholdPercent`. The check in the example YAML runs every 3 minutes. If it fails three consecutive times, the worker node is reloaded. This action is equivalent to running `ibmcloud ks worker-reload`. For example, the default `PodFailureThresholdPercent` is 50%. If the percentage of `NotReady` pods is greater than 50% three consecutive times, the worker node is reloaded. By default, pods in all namespaces are checked. To restrict the check to only pods in a specified namespace, add the `Namespace` field to the check. The pod check is enabled until you set the Enabled field to `false` or remove the check.
`checkhttp.json`	Defines an HTTP check that checks if an HTTP server that runs on your worker node is healthy. To use this check, you must deploy an HTTP server on every worker node in your cluster by using a [daemon set ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/). You must implement a health check that is available at the `/myhealth` path and that can verify whether your HTTP server is healthy. You can define other paths by changing the `Route` parameter. If the HTTP server is healthy, you must return the HTTP response code that is defined in `ExpectedStatus`. The HTTP server must be configured to listen on the private IP address of the worker node. You can find the private IP address by running `kubectl get nodes`. For example, consider two nodes in a cluster that have the private IP addresses 10.10.10.1 and 10.10.10.2. In this example, two routes are checked for a 200 HTTP response: `http://10.10.10.1:80/myhealth` and `http://10.10.10.2:80/myhealth`. The check in the example YAML runs every 3 minutes. If it fails three consecutive times, the worker node is rebooted. This action is equivalent to running `ibmcloud ks worker-reboot`. The HTTP check is disabled until you set the Enabled field to `true`.

Understanding the individual components of checks

Understanding the individual components of checks
`Check`	Enter the type of check that you want Autorecovery to use. `HTTP`: Autorecovery calls HTTP servers that run on each node to determine whether the nodes are running properly. `KUBEAPI`: Autorecovery calls the Kubernetes API server and reads the health status data reported by the worker nodes.
`Resource`	When the check type is `KUBEAPI`, enter the type of resource that you want Autorecovery to check. Accepted values are `NODE` or `POD`.
`FailureThreshold`	Enter the threshold for the number of consecutive failed checks. When this threshold is met, Autorecovery triggers the specified corrective action. For example, if the value is 3 and Autorecovery fails a configured check three consecutive times, Autorecovery triggers the corrective action that is associated with the check.
`PodFailureThresholdPercent`	When the resource type is `POD`, enter the threshold for the percentage of pods on a worker node that can be in a [`NotReady` ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes) state. This percentage is based on the total number of pods that are scheduled to a worker node. When a check determines that the percentage of unhealthy pods is greater than the threshold, the check counts as one failure.
`CorrectiveAction`	Enter the action to run when the failure threshold is met. A corrective action runs only while no other workers are being repaired and when this worker node is not in a cool-off period from a previous action. `REBOOT`: Reboots the worker node. `RELOAD`: Reloads all of the necessary configurations for the worker node from a clean OS.
`CooloffSeconds`	Enter the number of seconds Autorecovery must wait to issue another corrective action for a node that was already issued a corrective action. The cool off period starts at the time a corrective action is issued.
`IntervalSeconds`	Enter the number of seconds in between consecutive checks. For example, if the value is 180, Autorecovery runs the check on each node every 3 minutes.
`TimeoutSeconds`	Enter the maximum number of seconds that a check call to the database takes before Autorecovery terminates the call operation. The value for `TimeoutSeconds` must be less than the value for `IntervalSeconds`.
`Port`	When the check type is `HTTP`, enter the port that the HTTP server must bind to on the worker nodes. This port must be exposed on the IP of every worker node in the cluster. Autorecovery requires a constant port number across all nodes for checking servers. Use [daemon sets ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) when you deploy a custom server into a cluster.
`ExpectedStatus`	When the check type is `HTTP`, enter the HTTP server status that you expect to be returned from the check. For example, a value of 200 indicates that you expect an `OK` response from the server.
`Route`	When the check type is `HTTP`, enter the path that is requested from the HTTP server. This value is typically the metrics path for the server that is running on all of the worker nodes.
`Enabled`	Enter `true` to enable the check or `false` to disable the check.
`Namespace`	Optional: To restrict `checkpod.json` to checking only pods in one namespace, add the `Namespace` field and enter the namespace.

Create the configuration map in your cluster.
```
kubectl apply -f ibm-worker-recovery-checks.yaml
```
{: pre}
Verify that you created the configuration map with the name ibm-worker-recovery-checks in the kube-system namespace with the proper checks.
```
kubectl -n kube-system get cm ibm-worker-recovery-checks -o yaml
```
{: pre}

Deploy Autorecovery into your cluster by installing the ibm-worker-recovery Helm chart.

helm install --name ibm-worker-recovery iks-charts/ibm-worker-recovery  --namespace kube-system

{: pre}

After a few minutes, you can check the Events section in the output of the following command to see activity on the Autorecovery deployment.
```
kubectl -n kube-system describe deployment ibm-worker-recovery
```
{: pre}
If you do not see activity on the Autorecovery deployment, you can check the Helm deployment by running the tests that are included in the Autorecovery chart definition.
```
helm test ibm-worker-recovery
```
{: pre}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cs_health.md

cs_health.md

Logging and monitoring

Choosing a logging solution

Forwarding cluster and app logs to {{site.data.keyword.la_full_notm}}

Deprecated: Forwarding cluster, app, and Kubernetes API audit logs to {{site.data.keyword.loganalysisfull_notm}}

Forwarding cluster, app, and Kubernetes API audit logs to an external server

Understanding log forwarding to an external server

Forwarding cluster and app logs

Forwarding Kubernetes API audit logs

Filtering logs that are forwarded

Verifying, updating, and deleting log forwarding

Forwarding Kubernetes API audit logs to {{site.data.keyword.cloudaccesstrailfull_notm}}

Collecting master logs in an {{site.data.keyword.cos_full_notm}} bucket

Choosing a monitoring solution

Other health monitoring tools

Viewing cluster states

Cluster states

Master states

Worker node states

Configuring health monitoring for worker nodes with Autorecovery

Files

cs_health.md

Latest commit

History

cs_health.md

File metadata and controls

Logging and monitoring

Choosing a logging solution

Forwarding cluster and app logs to {{site.data.keyword.la_full_notm}}

Deprecated: Forwarding cluster, app, and Kubernetes API audit logs to {{site.data.keyword.loganalysisfull_notm}}

Forwarding cluster, app, and Kubernetes API audit logs to an external server

Understanding log forwarding to an external server

Forwarding cluster and app logs

Forwarding Kubernetes API audit logs

Filtering logs that are forwarded

Verifying, updating, and deleting log forwarding

Forwarding Kubernetes API audit logs to {{site.data.keyword.cloudaccesstrailfull_notm}}

Collecting master logs in an {{site.data.keyword.cos_full_notm}} bucket

Choosing a monitoring solution

Other health monitoring tools

Viewing cluster states

Cluster states

Master states

Worker node states

Configuring health monitoring for worker nodes with Autorecovery