Kubernetes Monitoring Tools

Discover how to monitor kubernetes clusters with monitoring tools like prometheus and grafana, and how they can help you manage the intricacies of your containerized infrastructure effectively.

The management of Kubernetes clusters comes with its own unique set of challenges. These complexities can be managed more easily via observability tools, such as prometheus. These tools provide insight into system health and status, improving the usability and ease of operation of Kubernetes. They can help improve the availability of Kubernetes clusters by alerting cluster administrators when the system encounters problems. They can also provide valuable data to help cluster administrators troubleshoot issues promptly.

Monitoring Kubernetes clusters is a challenging task, and several components must be considered. To monitor the health of a Kubernetes cluster, cluster administrators must watch various metrics of containers, nodes, services, and clusters. Kubernetes does not include a monitoring tool, but the de facto standard is Prometheus.

This article will introduce essential metrics to monitor in a Kubernetes cluster and show example metrics in Prometheus. We will also review several ideas to consider when deciding on a Kubernetes monitoring tool.

Important Considerations for Selecting a Kubernetes Monitoring Tool

Consideration	Description
Three Pillars of Observability	The three pillars of observability are metric monitoring, data tracing, and log analysis.
Self-Hosted vs. SaaS-Based Products	Monitoring tools fall into two broad categories: SaaS and self-managed. Examples of SaaS-based monitoring tools include New Relic, Datadog, Solarwinds, Dynatrace, and Kubecost. Examples of self-hosted monitoring tools include Prometheus, Kubernetes-Dashboard, metrics-server, Jaeger, and Sloop.
Essential Metrics to Monitor in a Kubernetes Cluster	The tool selected needs to have the ability to monitor several types of essential metrics, including application metrics, cluster metrics, and load balancer metrics.
Alerts and Notifications	A monitoring system should be able to send alerts when the system is near or in an error state.
Ease of Management	Several monitoring tools require extensive effort to install and configure, so ease of management is important.
Existing Monitoring Solutions	Special consideration should be made for any existing monitoring systems.
Integrations into Other tools	The monitoring tools should integrate with other tools in use at the organization.
System Availability	Monitoring systems should be fault-tolerant.
Support	It’s essential to consider the source of support for the system.

Three Pillars of Observability

Kubernetes is not an all-inclusive product, so users can integrate the monitoring solutions of their choice. Regardless of the tool chosen to for Kubernetes monitoring, the system should provide solutions for all three pillars of observability: metric monitoring, data tracing, and log analysis.

All three observability pillars are essential for managing Kubernetes clusters and must be implemented for complete visibility into Kubernetes clusters.

Metric Monitoring

Monitoring systems collect, measure, and visualize system performance data to generate insights about the health of systems. Thresholds and alerts notify engineering teams about potential issues with applications and the underlying infrastructure.

Data Tracing

Data tracing systems show how multiple services connect and how data flows between them. This data helps engineering teams detect and track issues and solve problems at the root level.

Log Analysis

Log analysis systems collect log data from across disparate systems into a single location, where they can be searched, analyzed, and visualized in real-time.

Self-Hosted vs. SaaS-Based Products

Kubernetes Monitoring tools fall into two broad categories: those based on software as a service (SaaS) and those that are self-managed. SaaS-based tools are managed by a vendor, whereas self-hosted tools require hands-on setup and maintenance.

SaaS-Based Monitoring Tools

SaaS Kubernetes monitoring tools provide monitoring services without the need to manage the monitoring system infrastructure. With most SaaS services, the monitoring service is subscribed to and not purchased. The vendor is responsible for system reliability and guarantees the monitoring service via SLAs, freeing the system administrator from needing to deal with these issues.

The potential downside to SaaS is that most systems monitored by the service must be exposed to the SaaS service via the internet. This exposure can be a security compliance issue for backend servers that are not generally exposed to the internet.

SaaS-based monitoring tools include New Relic, Datadog, Solarwinds, Dynatrace, and Kubecost.

Self-Hosted Monitoring Tools

These are traditional IT systems installed on the corporate intranet that are either purchased or open-source and managed by internal IT staff. The benefit of self-managed tools is that these products can be customized and configured to meet specific organizational needs. Most of these tools are open source, so they have little or no licensing cost.

One of the drawbacks to self-hosted systems is that they often take extensive setup and configuration to be highly available (fault tolerant). It is essential that monitoring systems be highly available and deployed in a different data center from the one hosting the systems it is monitoring. Fault tolerance prevents the scenario where a data center outage causes the application infrastructure and the monitoring/alerting system to go down simultaneously.

Self-hosted monitoring tools include Prometheus, kubernetes-dashboard, metrics-server, Jaeger, and Sloop. In addition, a popular open-source desktop application for managing Kubernetes clusters is called Lens. Lens includes several Kubernetes observability features, such as increased visibility, real-time statistics, log streams, and hands-on troubleshooting capabilities.

Essential Metrics to Monitor in a Kubernetes cluster

The Kubernetes ecosystem is increasing in breadth, and many tools and services are already available for it. However, the de facto standard monitoring system for Kubernetes is Prometheus. Several essential metrics that need to be tracked are listed below. We have also included example Prometheus metric information and the corresponding Prometheus exporter that creates the metric data.

Application Metrics

To ensure that the Kubernetes services are functioning correctly, you also need to keep an eye on a few key application metrics: application-specific metrics, application error rates, and application performance. Kubernetes keeps track of the current state of deployments, which is important for identifying unhealthy applications.

The following are categories of application-related metrics:

Application deployments:
- The health status or current condition of a deployment
- Metric: kube_deployment_status_condition
Application performance:
- How quickly the application responds to HTTP requests
- Metric: probe_duration_seconds
- Prometheus Exporter: blackbox-exporter
Application health:Verifies that application health check endpoints are responding correctly. For example, a request to an endpoint returns HTTP 2xx-399.
- Metric: TargetUrlDown
- Prometheus Exporter: blackbox-exporter
Application logs:
- The rates of error or success messages in logs generated by the application
- Metric: errors_total
- Prometheus Exporter: grok-exporter
  - Application log messages are collected by a system other than Prometheus, such as ELK or Loki, for further analysis and anomaly detection.
Container resource utilization:
- How much CPU and memory the containers are using
- Metric: container_cpu_load_average_10s
- Prometheus Exporter: cAdvisor

Cluster Metrics

A good picture of the deployed workload can be obtained through an overview of the number of active nodes, pods, and containers; this will also reveal the resource capacity. CPU, memory, network I/O pressure and disk consumption are crucial cluster metrics that show whether the cluster is properly utilizing its resources. This is key for Kubernetes monitoring prometheus.

Each Kubernetes node has finite resources that the running pods may use, so these metrics must be closely monitored:

Cluster health:
- The rate of Kubernetes errors in the event logs
- Metric: kube_event_count
- Prometheus Exporter: kubernetes-event-exporter
Cluster node health:
- The condition or health of the underlying cluster nodes
- Metric: kube_node_status_condition
- Prometheus Exporter: kube-state-metrics
Cluster resource utilization:
- The proportion of available resources (CPU, memory, storage, etc) to current utilization
- Metric: node_memory_MemFree_bytes
- Prometheus Exporter: node-exporter

Load Balancer Metrics

Modern software systems are accessed via HTTP, with traffic routed through a load balancer. Load balancers are important to monitor because the traffic flow to the application endpoints can provide important health metrics of requests, errors, successful requests, and healthy/unhealthy endpoints.

The following are some important metrics to monitor:

Load balancer performance:
- The current total of incoming and outgoing bytes
- Metrics: haproxy_server_bytes_in_total & haproxy_server_bytes_out_total
- Prometheus Exporter: haproxy-exporter
Load balancer health:
- The rate of HTTP errors processed by the load balancer
- Metric: haproxy_server_check_failures_total
- Prometheus Exporter: haprox-exporter
HTTP requests per second:
- The current number of sessions per second over the last elapsed second
- Metric: haproxy_server_current_session_rate
- Prometheus Exporter: haproxy-exporter

Other Important Metrics

Kubernetes monitoring tools are often extensible, especially open-source, self-managed tools. They can be used to monitor components that are not directly related to the application or the underlying cluster.

The following are less common metrics that can provide great value:

Job status:
- Rate of failed and successful completion of Kubernetes jobs
- Metric: kube_job_complete
- Prometheus Exporter: kube-state-metrics
- SSL lifetime:
  - SSL certificate expiration date or a number of days until expiration
  - Metric: probe_ssl_earliest_cert_expiry
  - Prometheus Exporter: Blackbox-exporter
- Cost analysis prediction
  - An estimate of the predicted cost of running resources based on trend analysis
  - See the “Cost Management” section for advanced cost management features beyond this metric
  - Metric: node_total_hourly_cost
  - Prometheus Exporter: opencost-exporter

Additional Considerations

Alerts and Notifications

A monitoring system should be capable of sending alerts when the system is near or in an error state. The tool selected needs to have the capability to set thresholds on metrics, so when the specified threshold is exceeded, the system sends an automated notification to the proper notification channel to be reviewed and acted upon.

Ease of Management

Some tools require extensive skills to install; if this is a new tool, it may take time for the team to learn it. The tool should make it easy to add new metrics and modify existing metrics as new issues arise, or new monitoring needs are identified. The tool should also include dashboards for you to query, display, monitor, understand and share your data.

Existing Monitoring Solutions

If your organization already has a monitoring system that can monitor Kubernetes, it may be desirable to use it for this purpose. Using the existing system would enable your team to have a single tool (single pane of glass) to monitor all systems. Also, if your team is already familiar with the tool, they would not need time to learn a new tool.

Integrations with Other Tools

The monitoring tools should integrate with other tools in the organization. For example, alerts should be sent to the primary communication channel for the team, such as Slack, MS Teams, SMS, or email. Additionally, diverse resource utilization metrics provided by Prometheus can be used by tools like Kubecost in order to provide cost management and usage analysis.

Cost Management

As discussed before, Prometheus is a rich data source with a diversity of metrics, some of which can be leveraged by tools such as Kubecost. Kubecost consumes utilization metrics (such as CPU, memory, GPU, storage, and network) and uses that data to provide insights into the efficiency, cost, and system health of Kubernetes clusters.

Kubecost can then segregate and allocate the costs across all Kubernetes resources such as namespaces, DaemonSets, pods, and even labels. The results are available in the form of dashboards and reports. The Kubecost alerts notify administrators of a sudden drop in efficiency or capacity headroom or detect a budget overrun.

You can download it here and use it for free forever on one cluster.

System Availability

The monitoring system must have the ability to be deployed in a highly available (fault-tolerant) configuration in a separate data center than the one used by the systems it is monitoring. If there is a system outage and the monitoring system is also down, the monitoring system is useless.

Support

It’s essential to consider the source of support for the system. You might need to open a support ticket if the monitoring system was purchased from a vendor. You need to rely on the developer community if it’s an open-source product.

Conclusion

Monitoring and observability tools provide insight into Kubernetes clusters. They help make the complexities of managing Kubernetes easier by providing a comprehensive view of the workloads and the underlying infrastructure.

Kubernetes does not include a monitoring system, so cluster administrators can choose the solution that works best for them. There are several important factors to consider when selecting a monitoring tool.

Several technologies related to the use of Kubernetes must be monitored to create more resilient systems: the applications, the containers running the applications, the infrastructure supporting the containers, and Kubernetes itself.

Our typical implementation is completed in under two weeks and costs less than $10,000.

Looking for complete visibility into your applications and infrastructure—without the high cost of commercial platforms? Prometheus monitoring might be the solution you’re searching for.

The Information Systems Group delivers full-featured, end-to-end monitoring solutions using Prometheus, the industry-leading open-source observability platform.