What is Monitoring? #


  • Monitoring: Monitoring helps detect if a Pod, Node, or application crashes or becomes unresponsive, enabling faster recovery
  • Identify and Resolve Issues Quickly: Real-time alerts help detect anomalies like memory leaks, traffic spikes, or API failures before users are affected
  • Track Resource Usage Over Time: Tools like Prometheus collect CPU, memory, and disk usage data to optimize resource allocation and reduce cost
  • Visualize Cluster Performance Trends: Dashboards (e.g., Grafana) provide easy-to-understand visuals for cluster-wide behavior and app performance
  • Support Auto-Scaling Decisions: Horizontal Pod Autoscalers (HPA) rely on metrics (like CPU usage) to decide when to add or remove pods
  • Aid in Capacity Planning: Historical monitoring data helps plan for future growth and avoid bottlenecks
  • Strengthen Security Posture: Alerting on unusual patterns (e.g., sudden spike in traffic) can surface potential security breaches or attacks

What is Observability? #


  • Observability: Observability lets you ask questions about your system’s current state without having to pre-define every scenario
  • Go Beyond Monitoring: While monitoring alerts on known issues, observability helps investigate unknown or unexpected problems
  • Correlate Metrics, Logs, and Traces: Combines performance data (metrics), detailed events (logs), and request flows (traces) to give a full picture of what’s happening
  • Diagnose Complex Failures Faster: Helps identify root causes across distributed systems by following request paths and spotting anomalies
  • Improve Reliability and User Experience: Enables teams to detect, debug, and resolve issues before users are impacted
  • Enable Proactive Operations: Allows engineers to explore trends and detect degraded performance early—even before alerts are triggered
  • Essential for Microservices and Cloud-Native Apps: With many moving parts, observability is critical for understanding interactions and performance across services

Monitoring vs Observability #


Aspect Monitoring Observability
Definition Tracking known metrics and events against predefined thresholds Ability to ask arbitrary questions about system state using metrics, logs, and traces
Goal Detect and alert on known problems or symptoms Understand why something went wrong and explore the unknown
Data Sources Primarily metrics (CPU, memory, request rate, error rate) Metrics + Logs + Traces (The Three Pillars of Observability)
Focus Alert on predefined conditions (e.g., CPU > 80%) Explore, correlate, and investigate unpredictable behavior across distributed systems
Scope Point-in-time status of specific components or metrics End-to-end system state across multiple layers, services, and time
Mindset Known-knowns: What we expect to happen and want to monitor Unknown-unknowns: What we don’t know yet, but want to uncover with flexible queries and correlation
Example Alert me if memory usage > 90% on node X Why did service A’s latency spike only when talking to service B last Thursday?

What is OpenTelemetry? #


  • Understanding Need for Consistent Observability Standards: Modern apps span multiple services, languages, and environments—making it hard to trace performance, debug issues, and understand system behavior using isolated tools
  • What is OpenTelemetry: An open-source, vendor-neutral standard for collecting metrics, logs, and traces from applications
  • Collects Telemetry Data in One Format: Unifies different types of data across services, so you don't have to use multiple tools or agents for different data types
  • Supports Multiple Programming Languages: Enables consistent observability whether your microservices are in Java, Python, Go, or Node.js
  • Sends Data to Any Backend: Export telemetry to systems like Prometheus, Jaeger, Grafana, Datadog, or Elastic without vendor lock-in
  • Enables End-to-End Visibility: Helps correlate latency spikes, failed requests, or memory usage with specific services or traces across the entire system
  • Simplifies Debugging and Root-Cause Analysis: Makes it easier to find where and why failures happen—even in complex distributed environments
  • Backed by CNCF and Major Cloud Providers: Ensures it’s a trusted, long-term standard with active community and enterprise support

How well is OpenTelemetry supported in Kubernetes? #


  • Designed for Cloud-Native Environments: OpenTelemetry is designed to work seamlessly with platforms like Kubernetes
  • Supports All Telemetry Types: Collects metrics, logs, and traces from Pods, Nodes, and Services
  • Kubernetes Metadata Enrichment: Automatically adds Pod names, Namespaces, Node info, and Labels to telemetry data, making debugging easier
  • Collector Works as a DaemonSet or Sidecar: The OpenTelemetry Collector can run as:
    • A Sidecar to collect application-specific telemetry
    • A DaemonSet to collect node-level and container-level metrics
  • Supports Export Transformations: The Collector allows you to filter, batch, transform, and export data to one or more backends
  • Integration with Popular Tools: Easily integrates with Prometheus, Grafana, Jaeger, Zipkin, Datadog, Elastic, etc., for backend analysis

What is the need for a Service Mesh? #


  • Service Mesh: A service mesh handles complex networking logic like retries, timeouts, and load balancing between microservices without modifying application code
  • Enable Secure Traffic Between Services: Automatically encrypt and authenticate service communication using mTLS, improving security within the cluster
  • Gain Deep Visibility into Traffic Flow: Provides observability features like metrics, logs, and distributed tracing to understand performance and troubleshoot issues
  • Support Progressive Delivery Techniques: Enables canary deployments, blue-green rollouts, and traffic mirroring for safer releases
  • Improve Resilience and Fault Tolerance: Automatically retry failed requests, route around unhealthy instances, and isolate failures to minimize impact

How does a Service Mesh Work? #


  • Inject Sidecar Proxies into Pods: Each service Pod runs alongside a sidecar proxy (like Envoy) that intercepts all incoming and outgoing traffic
  • Control Plane Configures the Mesh: A central control plane (e.g., Istio, Linkerd) manages policies, security settings, and traffic rules across proxies
  • Inter-Service Communication Goes Through Proxies: Instead of talking directly, services route requests through their sidecars, which handle encryption, retries, and load balancing
  • Observability Data Is Collected Automatically: Proxies capture metrics, logs, and traces without changes to application code, enabling rich observability
  • Security Is Enforced Transparently: Sidecars use mTLS to verify identities and encrypt data in transit between services

List a Few Observability Tools in Kubernetes #


Tool Category Description
OpenTelemetry Telemetry Instrumentation Open standard to collect metrics, logs, and traces from apps and infrastructure for consistent observability
Istio Service Mesh + Telemetry Adds telemetry (metrics, logs, traces) by automatically injecting sidecars, and provides deep observability into service communication
Prometheus Metrics Open-source system that collects and stores time-series metrics to monitor resource usage, app health, and service performance
Kube-State-Metrics Cluster State Metrics Exposes Kubernetes object states (Deployments, Nodes, etc.) as metrics for Prometheus
Grafana Visualization & Dashboards Provides interactive dashboards by visualizing metrics, logs, and traces in real-time for better decision-making
Jaeger Distributed Tracing Traces requests across microservices to identify latency bottlenecks and dependency issues
Grafana Tempo Distributed Tracing Scalable, minimal-ops distributed tracing backend designed to integrate deeply with Grafana
Elasticsearch Log Storage & Search Distributed search engine that indexes logs and enables quick querying and analysis of system/application logs
Kibana Log Visualization UI tool for Elasticsearch that lets teams explore and visualize logs and analytics data

What is need for Probes? #


  • Probes: Probes help Kubernetes know if your application is working correctly and ready to serve traffic
  • Enable Smooth Rolling Updates: Probes ensure new Pods are healthy before terminating old ones, reducing downtime during deployments
Probe Type Purpose Use Case
Startup Probe Checks if the container application has started up Used to delay liveness and readiness checks until the app has fully started
Liveness Probe Checks if the container is alive and running properly Used to restart the container if it has failed
Readiness Probe Checks if the container is ready to serve requests (Sometimes, applications are temporarily unable to serve traffic. In such cases, you don't want to kill the application, but you don't want to send it requests either.) Used to control traffic, ensuring only ready containers get traffic (If a readiness probe fails repeatedly, Kubernetes will NOT restart the container. However, the pod will not receive any traffic. As soon as the probe succeeds, the pod is marked ready again, and traffic resumes.)
apiVersion: v1
kind: Pod
metadata:
  name: app-with-probes
  labels:
    app: probe-demo
spec:
  containers:
    - name: my-app
      image: nginx
      ports:
        - containerPort: 80

      # WHAT: Startup Probe
      # WHY:  Allows app more time to start
      # WHEN: Blocks liveness and readiness checks until startup passes
      startupProbe:
        httpGet:
          path: /
          port: 80
        # allow 30 consecutive failures before restarting the container
        failureThreshold: 30
        # Run every 5 seconds
        periodSeconds: 5

      # WHAT: Readiness Probe
      # WHY:  Checks if app is ready to receive traffic
      # WHEN: Only sends traffic after this passes
      readinessProbe:
        httpGet:
          path: /ready
          port: 80
        # Waits 5 seconds after container starts
        # before performing the first probe
        initialDelaySeconds: 5
        # How often to execute
        periodSeconds: 5

      # WHAT: Liveness Probe
      # WHY:  Checks if the app is alive
      # WHEN: If it fails repeatedly, Pod is restarted
      livenessProbe:
        httpGet:
          path: /healthy
          port: 80
        # Wait 10 seconds after the container starts 
        # before performing the first health check
        initialDelaySeconds: 10
        # How often to execute
        periodSeconds: 10
        # Number of consecutive failures 
        # before marking unhealthy
        failureThreshold: 3



      # NOTE:
      # You can use `exec` instead of `httpGet`:
      #
      # livenessProbe:
      #   exec:
      #     command:
      #       - cat
      #       - /tmp/healthy
      #

What is the need for a DaemonSet? #


  • Deploy One Pod Per Node Automatically: A DaemonSet ensures that a copy of a Pod runs on every Node (or selected Nodes) in the cluster without manual intervention
  • Collect Logs and Metrics Consistently: Commonly used to run system and application observability pods on every node in the cluster to enable centralized observability
  • Enable Node-Level Monitoring and Security: Ideal for running node-level pods like security agents, or storage drivers
  • Control Pod Placement Using Node Selectors or Tolerations: Can be configured to run only on specific node types (e.g., GPU nodes, tainted nodes)
# WHAT: Run a Pod on every Node in the cluster
# WHY:  For node-level tasks like monitoring/logging
# WHEN: You need to ensure the same Pod runs
#       on all current and future Nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: node-monitor
    # WHAT: Links the DaemonSet to Pods it manages

  template:
    metadata:
      labels:
        app: node-monitor
    spec:
      containers:
        - name: monitor-agent
          image: busybox
          command:
            - /bin/sh
            - -c
            - >
              while true; do
              echo "Monitoring node...";
              sleep 60;
              done
          # WHAT: Basic infinite loop for demo
          # WHY: Represents a monitoring process

# -----------------------------------------------
# VARIATION: Run only on specific Nodes
# Add nodeSelector to limit to node types
#
# spec:
#   template:
#     spec:
#       nodeSelector:
#         disktype: ssd
#
# WHY: Useful if only some nodes need this agent
# -----------------------------------------------