Kubernetes

Containers

AWS

Linux

Observability & Monitoring

Kubernetes Monitoring | Tools & Techniques You Should Know

February 2, 2025

Introduction

In the current cloud industry, Kubernetes has proved to be a great option in your cloud infrastructure as the orchestration tool for your containers. Some DevOps Engineers may even think Kubernetes is the gold standard for automating, scaling, and managing containerized applications. With the increase of reliabilities to Kubernetes, applications themselves or their setup have become more complex than ever. As a result, the need for monitoring has been in place.

The monitoring for Kubernetes involves the collection and analysis of metrics, logs, and events. They provide insights into the health and performance of the systems or applications running on the clusters. A well-designed monitoring system helps in detecting and diagnosing issues. You can have better resource planning, optimization, or even enhancement of system reliability from the historical metrics.

In this article, we will talk about some important concepts in monitoring and popular monitoring tools that you should know about their main features. Depending on your needs, you will know the right tool for your self-hosted clusters or in your work. Whether you are new to Kubernetes or a professional administrator, it is always important to choose the best tool.

Something to Know About Monitoring

Monitoring is basic but non-trivial because there are many types of metrics available. How to choose or what to monitor is a lesson to learn.

There are some components you should monitor,

Nodes. Node is the core of any Kubernetes cluster. It hosts system Pods, application Pods, and maybe controller Pods. It is important to know if nodes are working as expected.
Pods. Pods are what you run. You need to know for sure what is happening on your applications. Cluster-level metrics determine the scalability, and application-level metrics provide insights.
Ingress. Ingresses handle traffic towards Pods or Services. Getting statistics helps with troubleshooting network-related issues. As the door of your applications, it should be working fine all the time with detailed records of who entered.
Persistent Storage. Storage is where your data is kept. You should make sure the volumes are well-planned and its resources are utilized correctly. In some cases, the unavailability of storage brings downtime to your applications.

Top Tools You Should Know

Metrics Server

The Metrics Server is a scalable and efficient source of container resource metrics specifically for Kubernetes with a built-in auto-scaling pipeline. It collects metrics from Kubelets directly and sends them to the API servers. As the data is stored in the Kubernetes API servers, those metrics can also be easily retrieved by executing `kubectl top` in the cluster with it installed.

It is also a core component if you use the official Horizontal Pod Autoscaler or Vertical Pod Autoscaler. Their CPU or memory-based nature at the container level works from the metrics the Metrics Server collects.

The open-sourced Metrics Server could be deployed to any Kubernetes cluster by the provided YAML manifests or the official Helm Chart.

Official website: https://github.com/kubernetes-sigs/metrics-server

Kube-State-Metrics

The Kube-State-Metrics, or the KSM, is a service that listens to the Kubernetes API servers and generates metrics about the state of Kubernetes objects, like Deployments, Nodes, and Pods. It generates metrics without any modification, which means what is stored in API servers will be retrieved and generated as metrics directly.

Moreover, the KSM is designed to be consumed by Prometheus or any scraper compatible with a Prometheus client endpoint. If you are already using Prometheus as a metric collector, you should probably include it in your monitoring stack.

Same as the previous Metrics Server, the KSM is open-sourced and can be deployed by both YAML manifests or the official Helm Chart.

Official website: https://github.com/kubernetes/kube-state-metrics

Kubernetes Dashboard

The Kubernetes Dashboard is a general-purpose, web-based UI for Kubernetes clusters. It visualizes the metrics into graphs and charts with colors. Additionally, it acts as a web console of your Kubernetes cluster, as you can perform many actions like using the `kubectl` command.

From a monitoring perspective, it has processed the metrics and become meaningful information. For example, the current CPU and memory usage of Pod, the number of running Pods, and the status of Deployments.

It is open-sourced and community-maintained like the Metrics Server and KSM, but slightly different in deployment method, which only supports deployment by their Helm Chart.

Figure 2 - Kubernetes Dashboard | Example UI, Sources: https://github.com/kubernetes/dashboard.

Official website: https://github.com/kubernetes/dashboard

Prometheus

Prometheus, a Cloud Native Computing Foundation project, is a system and service monitoring system. By collecting metrics from configured targets, it displays the result with pre-defined rules. It also supports the alert system that can be triggered if a specific condition has been satisfied. As it is directly supported by Kubernetes and its features of reliability, scalability, and flexibility, it has become a popular choice for Kubernetes monitoring.

Its mechanism is to collect metrics through HTTP or HTTPS and store them as time series data, which means the metric information is stored with the timestamp, as their data model. With a pull-based data collection approach, the metrics will be stored in a time-series database optimized for high performance and efficient storage. It comes with a query language named PromQL for querying metrics data. Moreover, it integrated the Alertmanager to process alerts and notifications.

Here are some common Prometheus metrics exporters in Kubernetes.

Node exporter, for node-related metrics like CPU and memory.
Kube-state-metrics, for cluster-related metrics like Deployment status and Pod scheduling.
Control panel metrics, for metrics related to the core components like `kubelet`, `kube-dns`, `scheduler`, and `etcd`.

Among the popular monitoring tools on the market, Prometheus is one of the best options. It mainly works as a metrics collector with a simple user interface, but it is well-known to be integrated with Grafana as the dashboard to query Prometheus for data by setting up corresponding data sources.

Figure 3 - Prometheus | Built-in user interface.

Figure 4 - Prometheus | Grafana dashboard with Prometheus data source, source: https://prometheus.io.

For its deployment methods, it has official Docker images on Quay.io or Docker Hub that you can simply pull the image and run locally. If you are not a fan of Dockerized applications, they have provided precompiled binaries for installation. As they are open-sourced, you can also pull their code and build from the source code. Furthermore, a community-maintained chart enables the support of Helm installation.

Additionally, with proper configuration, the Prometheus metrics can be used for autoscaling in Kubernetes.

Official website: https://prometheus.io

Grafana

Grafana is an open-source platform for monitoring, visualization, and analytics. It allows users to create dashboards for visualizations of time-series data from different sources. Data sources include databases, APIs, logs, and external monitoring systems. Grafana itself does not perform any metric collecting, as it focuses on the presentation of data with the good use of its dashboard features. Due to its wide support of data sources, including many Kubernetes-oriented monitoring tools like Prometheus, it has also become a popular option among Kubernetes users.

Their dynamic dashboards and query editors are the signature features. The dashboards provide the power of customization by adding panels, graphs, tables, and many other visual components. They have also provided a bunch of dashboard templates officially or from the communities. The query editors allow users to query from multiple data sources. Data exploration and analysis become much easier.

Integrated different metrics collectors for Kubernetes, Grafana can monitor all the Kubernetes resources, including Nodes, Pods, and Containers. You can get all the necessary information on one or multiple clusters on a single platform. With the setup of its built-in alert system, you can get notified of any unexpected thresholds or metrics.

Figure 5 - Grafana | Example dashboard for Kubernetes cluster monitoring, source: https://grafana.com.

Figure 6 - Grafana | Example dashboard of container insights, source: https://grafana.com.

Figure 7 - Grafana | Example dashboard of Pod logs, source: https://grafana.com.

Like Prometheus, Grafana supports installation by binaries, Docker images, YAML manifests, or Helm Chart on a Kubernetes cluster. Their Helm Chart is also community-maintained.

Official website: https://grafana.com

EFK Stack

The EFK stack includes three tools, which are Elasticsearch, Fluentd, and Kibana. Each tool plays a specific role to process and visualize the data. The whole stack is a comprehensive solution for logs and metrics monitoring and visualization.

Elasticsearch is a RESTful search engine that is designed for real-time search and complex data analysis. It stores data in the form of JSON documents and provides full-text search capabilities.

Fluentd is a super-fast, lightweight, and scalable log and metrics collector and forwarder. It collects data from the Kubernetes Pod and sends them to Elasticsearch for further indexing.

Kibana provides a web interface for the data stored in Elasticsearch. It aims to visualize the data in graphs, charts, dashboards, and more views.

The EFK stack is more like a log management solution from the Kubernetes monitoring perspective. It ingests a huge amount of the application logs from various Pods in your clusters. You will get real-time events that can be used to detect system issues and troubleshooting.

Deploying the whole stack is much more complicated than the previously introduced tools. You can write your own YAML manifests to deploy to the Kubernetes cluster, or directly use managed services provided by them or some cloud providers.

Official website:

Elasticsearch: https://www.elastic.co

Fluentd: https://github.com/fluent/fluentd

Kibana: https://github.com/elastic/kibana