5 Kubernetes Monitoring to-dos

May 1st, 2018 8:59am by Steven Czerwinski

Steven Czerwinski, Scalyr Head of Engineering

Steven Czerwinski is the co-founder and head of engineering at Scalyr. Prior to Scalyr, he spent 8 years at Google, specializing in building distributed database systems for consumer apps. He was the backend tech lead for Picasa Web Albums, Tech Lead for Cosmo, and Tech Lead to Stratus, the replacement for Cosmo driving further unification including Google Plus. Prior to Google, Steven was in the Ph.D. program at UC Berkeley specializing in distributed and mobile systems. He successfully defended his dissertation, but left ABD to start his job at Google.

If you’re on the DevOps front line, Kubernetes is fast becoming an essential part of your production cloud environment. Since container orchestration is critical to deploying, scaling, and managing your containerized applications, monitoring Kubernetes needs to be a big part of your monitoring strategy.

Container environments don’t operate like traditional ones. So, if you are monitoring your applications and infrastructure, you need to be thoughtful about how you monitor your container environment in which you are running them. Here are five best practices to inform your strategy:

Centralize your logs and metrics. Orchestrating your containerized services and workloads through Kubernetes brings order to the chaos, but remember that your environment is still decentralized. You will give yourself a fighting chance if you centralize your logs and metrics.
Account for ephemeral containers. The beauty of container orchestration is it’s easy to start, stop, kill, and clean up your containers in short order. However, monitoring them may not be so easy. You still need to debug problems and monitor cluster activity, even when services are coming and going. The trick is to grab the logs and metrics before they’re gone. If you don’t, your metrics will look more like the graph on the left than the one on the right.
Simplify, simplify, simplify. With all of the moving pieces in your container environment (services, APIs, containers, orchestration tool), you need to monitor without introducing unneeded complexity. Rather than bloating your container with various monitoring agents, each requiring updates on unique schedules, abstract your monitoring and management tools from what you’re monitoring and managing. This will also help your engineers focus on building and delivering software, not operating the delivery platform.
Monitor each layer explicitly. You will need to collect logs and monitor for errors, failures, and performance issues at each layer — the pod, the container, and the controller manager — of your environment. For example, you’ll need to be able to troubleshoot pod issues, ensure that the container is working, and collect runtime metrics in the controller manager.
Ensure data consistency across your layers. For fast, accurate debugging, you need to ensure data consistency across all of the layers in your container environment. Things like accurate timestamps, consistent units of measurement (such as milliseconds vs. seconds), and collecting a common set of metrics and logs across applications and components will help you troubleshoot and debug quickly and accurately across all of your layers.

One best practice for accomplishing these to-dos in a simple, straightforward manner is to monitor the containers in your Kubernetes environment without touching your application containers. Do this by introducing a DaemonSet, or alternatively a sidecar, into your Kubernetes environment(s) that sits alongside your containerized services and includes your logging and metrics collection agent. Deploying in this method will ensure consistent data collection, minimize the changes required to your application containers, and most importantly, eliminate the possibility of selective blindness in your production environment.

A few ways to implement this include:

Introduce a DaemonSet with the Fluentd logging agent (this will give you logging but not metrics). If you already have an ELK cluster configured, this is probably the option for you. Learn more here.
Introduce a DaemonSet or sidecar with the Prometheus metrics agent (CoreOS has done an excellent job of integrating Prometheus and Kubernetes). Running Prometheus on your Kubernetes cluster will give you metrics instrumentation, querying, and alerting. Learn more here.
A variety of metrics and performance monitoring tools, including Heapster, DataDog, cAdvisor, New Relic, Weave/VMware, and several others also offer a DaemonSet or sidecar options for Kubernetes monitoring.

Steven Czerwinski is the cofounder and head of engineering at Scalyr. Prior to Scalyr, he spent 8 years at Google, specializing in building distributed database systems for consumer apps. He was the backend tech lead for Picasa Web Albums, Tech...