Kubernetes Observability

What is Monitoring? #

Monitoring: Monitoring helps detect if a Pod, Node, or application crashes or becomes unresponsive, enabling faster recovery
Identify and Resolve Issues Quickly: Real-time alerts help detect anomalies like memory leaks, traffic spikes, or API failures before users are affected
Track Resource Usage Over Time: Tools like Prometheus collect CPU, memory, and disk usage data to optimize resource allocation and reduce cost
Visualize Cluster Performance Trends: Dashboards (e.g., Grafana) provide easy-to-understand visuals for cluster-wide behavior and app performance
Support Auto-Scaling Decisions: Horizontal Pod Autoscalers (HPA) rely on metrics (like CPU usage) to decide when to add or remove pods
Aid in Capacity Planning: Historical monitoring data helps plan for future growth and avoid bottlenecks
Strengthen Security Posture: Alerting on unusual patterns (e.g., sudden spike in traffic) can surface potential security breaches or attacks

What is Observability? #

Observability: Observability lets you ask questions about your system’s current state without having to pre-define every scenario
Go Beyond Monitoring: While monitoring alerts on known issues, observability helps investigate unknown or unexpected problems
Correlate Metrics, Logs, and Traces: Combines performance data (metrics), detailed events (logs), and request flows (traces) to give a full picture of what’s happening
Diagnose Complex Failures Faster: Helps identify root causes across distributed systems by following request paths and spotting anomalies
Improve Reliability and User Experience: Enables teams to detect, debug, and resolve issues before users are impacted
Enable Proactive Operations: Allows engineers to explore trends and detect degraded performance early—even before alerts are triggered
Essential for Microservices and Cloud-Native Apps: With many moving parts, observability is critical for understanding interactions and performance across services

Monitoring vs Observability #

Aspect	Monitoring	Observability
Definition	Tracking known metrics and events against predefined thresholds	Ability to ask arbitrary questions about system state using metrics, logs, and traces
Goal	Detect and alert on known problems or symptoms	Understand why something went wrong and explore the unknown
Data Sources	Primarily metrics (CPU, memory, request rate, error rate)	Metrics + Logs + Traces (The Three Pillars of Observability)
Focus	Alert on predefined conditions (e.g., CPU > 80%)	Explore, correlate, and investigate unpredictable behavior across distributed systems
Scope	Point-in-time status of specific components or metrics	End-to-end system state across multiple layers, services, and time
Mindset	Known-knowns: What we expect to happen and want to monitor	Unknown-unknowns: What we don’t know yet, but want to uncover with flexible queries and correlation
Example	Alert me if memory usage > 90% on node X	Why did service A’s latency spike only when talking to service B last Thursday?

What is OpenTelemetry? #

Understanding Need for Consistent Observability Standards: Modern apps span multiple services, languages, and environments—making it hard to trace performance, debug issues, and understand system behavior using isolated tools
What is OpenTelemetry: An open-source, vendor-neutral standard for collecting metrics, logs, and traces from applications
Collects Telemetry Data in One Format: Unifies different types of data across services, so you don't have to use multiple tools or agents for different data types
Supports Multiple Programming Languages: Enables consistent observability whether your microservices are in Java, Python, Go, or Node.js
Sends Data to Any Backend: Export telemetry to systems like Prometheus, Jaeger, Grafana, Datadog, or Elastic without vendor lock-in
Enables End-to-End Visibility: Helps correlate latency spikes, failed requests, or memory usage with specific services or traces across the entire system
Simplifies Debugging and Root-Cause Analysis: Makes it easier to find where and why failures happen—even in complex distributed environments
Backed by CNCF and Major Cloud Providers: Ensures it’s a trusted, long-term standard with active community and enterprise support

How well is OpenTelemetry supported in Kubernetes? #

Designed for Cloud-Native Environments: OpenTelemetry is designed to work seamlessly with platforms like Kubernetes
Supports All Telemetry Types: Collects metrics, logs, and traces from Pods, Nodes, and Services
Kubernetes Metadata Enrichment: Automatically adds Pod names, Namespaces, Node info, and Labels to telemetry data, making debugging easier
Collector Works as a DaemonSet or Sidecar: The OpenTelemetry Collector can run as:
- A Sidecar to collect application-specific telemetry
- A DaemonSet to collect node-level and container-level metrics
Supports Export Transformations: The Collector allows you to filter, batch, transform, and export data to one or more backends
Integration with Popular Tools: Easily integrates with Prometheus, Grafana, Jaeger, Zipkin, Datadog, Elastic, etc., for backend analysis

What is the need for a Service Mesh? #

Service Mesh: A service mesh handles complex networking logic like retries, timeouts, and load balancing between microservices without modifying application code
Enable Secure Traffic Between Services: Automatically encrypt and authenticate service communication using mTLS, improving security within the cluster
Gain Deep Visibility into Traffic Flow: Provides observability features like metrics, logs, and distributed tracing to understand performance and troubleshoot issues
Support Progressive Delivery Techniques: Enables canary deployments, blue-green rollouts, and traffic mirroring for safer releases
Improve Resilience and Fault Tolerance: Automatically retry failed requests, route around unhealthy instances, and isolate failures to minimize impact

How does a Service Mesh Work? #

Inject Sidecar Proxies into Pods: Each service Pod runs alongside a sidecar proxy (like Envoy) that intercepts all incoming and outgoing traffic
Control Plane Configures the Mesh: A central control plane (e.g., Istio, Linkerd) manages policies, security settings, and traffic rules across proxies
Inter-Service Communication Goes Through Proxies: Instead of talking directly, services route requests through their sidecars, which handle encryption, retries, and load balancing
Observability Data Is Collected Automatically: Proxies capture metrics, logs, and traces without changes to application code, enabling rich observability
Security Is Enforced Transparently: Sidecars use mTLS to verify identities and encrypt data in transit between services

List a Few Observability Tools in Kubernetes #

Tool	Category	Description
OpenTelemetry	Telemetry Instrumentation	Open standard to collect metrics, logs, and traces from apps and infrastructure for consistent observability
Istio	Service Mesh + Telemetry	Adds telemetry (metrics, logs, traces) by automatically injecting sidecars, and provides deep observability into service communication
Prometheus	Metrics	Open-source system that collects and stores time-series metrics to monitor resource usage, app health, and service performance
Kube-State-Metrics	Cluster State Metrics	Exposes Kubernetes object states (Deployments, Nodes, etc.) as metrics for Prometheus
Grafana	Visualization & Dashboards	Provides interactive dashboards by visualizing metrics, logs, and traces in real-time for better decision-making
Jaeger	Distributed Tracing	Traces requests across microservices to identify latency bottlenecks and dependency issues
Grafana Tempo	Distributed Tracing	Scalable, minimal-ops distributed tracing backend designed to integrate deeply with Grafana
Elasticsearch	Log Storage & Search	Distributed search engine that indexes logs and enables quick querying and analysis of system/application logs
Kibana	Log Visualization	UI tool for Elasticsearch that lets teams explore and visualize logs and analytics data

What is need for Probes? #

Probes: Probes help Kubernetes know if your application is working correctly and ready to serve traffic
Enable Smooth Rolling Updates: Probes ensure new Pods are healthy before terminating old ones, reducing downtime during deployments

Probe Type	Purpose	Use Case
Startup Probe	Checks if the container application has started up	Used to delay liveness and readiness checks until the app has fully started
Liveness Probe	Checks if the container is alive and running properly	Used to restart the container if it has failed
Readiness Probe	Checks if the container is ready to serve requests (Sometimes, applications are temporarily unable to serve traffic. In such cases, you don't want to kill the application, but you don't want to send it requests either.)	Used to control traffic, ensuring only ready containers get traffic (If a readiness probe fails repeatedly, Kubernetes will NOT restart the container. However, the pod will not receive any traffic. As soon as the probe succeeds, the pod is marked ready again, and traffic resumes.)

apiVersion: v1
kind: Pod
metadata:
  name: app-with-probes
  labels:
    app: probe-demo
spec:
  containers:
    - name: my-app
      image: nginx
      ports:
        - containerPort: 80

      # WHAT: Startup Probe
      # WHY:  Allows app more time to start
      # WHEN: Blocks liveness and readiness checks until startup passes
      startupProbe:
        httpGet:
          path: /
          port: 80
        # allow 30 consecutive failures before restarting the container
        failureThreshold: 30
        # Run every 5 seconds
        periodSeconds: 5

      # WHAT: Readiness Probe
      # WHY:  Checks if app is ready to receive traffic
      # WHEN: Only sends traffic after this passes
      readinessProbe:
        httpGet:
          path: /ready
          port: 80
        # Waits 5 seconds after container starts
        # before performing the first probe
        initialDelaySeconds: 5
        # How often to execute
        periodSeconds: 5

      # WHAT: Liveness Probe
      # WHY:  Checks if the app is alive
      # WHEN: If it fails repeatedly, Pod is restarted
      livenessProbe:
        httpGet:
          path: /healthy
          port: 80
        # Wait 10 seconds after the container starts 
        # before performing the first health check
        initialDelaySeconds: 10
        # How often to execute
        periodSeconds: 10
        # Number of consecutive failures 
        # before marking unhealthy
        failureThreshold: 3



      # NOTE:
      # You can use `exec` instead of `httpGet`:
      #
      # livenessProbe:
      #   exec:
      #     command:
      #       - cat
      #       - /tmp/healthy
      #

What is the need for a DaemonSet? #

Deploy One Pod Per Node Automatically: A DaemonSet ensures that a copy of a Pod runs on every Node (or selected Nodes) in the cluster without manual intervention
Collect Logs and Metrics Consistently: Commonly used to run system and application observability pods on every node in the cluster to enable centralized observability
Enable Node-Level Monitoring and Security: Ideal for running node-level pods like security agents, or storage drivers
Control Pod Placement Using Node Selectors or Tolerations: Can be configured to run only on specific node types (e.g., GPU nodes, tainted nodes)

# WHAT: Run a Pod on every Node in the cluster
# WHY:  For node-level tasks like monitoring/logging
# WHEN: You need to ensure the same Pod runs
#       on all current and future Nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: node-monitor
    # WHAT: Links the DaemonSet to Pods it manages

  template:
    metadata:
      labels:
        app: node-monitor
    spec:
      containers:
        - name: monitor-agent
          image: busybox
          command:
            - /bin/sh
            - -c
            - >
              while true; do
              echo "Monitoring node...";
              sleep 60;
              done
          # WHAT: Basic infinite loop for demo
          # WHY: Represents a monitoring process

# -----------------------------------------------
# VARIATION: Run only on specific Nodes
# Add nodeSelector to limit to node types
#
# spec:
#   template:
#     spec:
#       nodeSelector:
#         disktype: ssd
#
# WHY: Useful if only some nodes need this agent
# -----------------------------------------------

On this page

What is Monitoring? #

What is Observability? #

Monitoring vs Observability #

What is OpenTelemetry? #

How well is OpenTelemetry supported in Kubernetes? #

What is the need for a Service Mesh? #

How does a Service Mesh Work? #

List a Few Observability Tools in Kubernetes #

What is need for Probes? #

What is the need for a DaemonSet? #