Kubernetes Series - Part 7: Observability
/ 4 min read
Series Navigation
- Part 1: Core Fundamentals
- Part 2: Workload Management
- Part 3: Networking Essentials
- Part 4: Storage and Persistence
- Part 5: Configuration and Secrets
- Part 6: Security and Access Control
- Part 7: Observability (Current)
- Part 8: Advanced Patterns
- Part 9: Production Best Practices
Introduction
After managing large-scale Kubernetes clusters, I’ve learned that proper observability is crucial for maintaining reliable systems. In this article, I’ll share practical insights from implementing monitoring, logging, and tracing solutions in production environments.
Monitoring with Prometheus
Here’s our production Prometheus configuration:
apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: name: prometheusspec: serviceAccountName: prometheus serviceMonitorSelector: matchLabels: team: platform resources: requests: memory: 400Mi limits: memory: 2Gi retention: 15d storage: volumeClaimTemplate: spec: storageClassName: fast-ssd resources: requests: storage: 100GiService Monitoring
How we monitor services:
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: app-monitor labels: team: platformspec: selector: matchLabels: app: myapp endpoints: - port: metrics interval: 15s path: /metrics - port: metrics interval: 30s path: /metrics/detailed metricRelabelings: - sourceLabels: [__name__] regex: 'http_requests_total' action: keepAlert Configuration
Our critical alerts setup:
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: critical-alertsspec: groups: - name: node rules: - alert: HighCPUUsage expr: instance:node_cpu_utilisation:rate5m > 0.8 for: 5m labels: severity: warning annotations: description: "CPU usage above 80% for 5 minutes" - alert: MemoryExhausted expr: instance:node_memory_utilisation:ratio > 0.95 for: 5m labels: severity: critical annotations: description: "Memory usage above 95%"Logging with EFK Stack
Our Fluent Bit configuration:
apiVersion: v1kind: ConfigMapmetadata: name: fluent-bit-configdata: fluent-bit.conf: | [SERVICE] Flush 1 Log_Level info Parsers_File parsers.conf
[INPUT] Name tail Path /var/log/containers/*.log Parser docker Tag kube.* Refresh_Interval 5 Mem_Buf_Limit 5MB Skip_Long_Lines On
[FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Merge_Log On K8S-Logging.Parser On
[OUTPUT] Name es Match * Host elasticsearch-master Port 9200 Index kubernetes_cluster Type _doc HTTP_User ${ES_USER} HTTP_Passwd ${ES_PASSWORD} Logstash_Format On Replace_Dots On Retry_Limit FalseLog Aggregation Tips
-
Collection Strategy
- Use node-level daemonsets
- Implement proper buffering
- Handle multiline logs
-
Performance Tuning
- Set appropriate buffer limits
- Configure retention policies
- Monitor resource usage
Distributed Tracing
Our Jaeger configuration:
apiVersion: jaegertracing.io/v1kind: Jaegermetadata: name: jaegerspec: strategy: production storage: type: elasticsearch options: es: server-urls: https://elasticsearch:9200 username: ${ES_USER} password: ${ES_PASSWORD} ingress: enabled: true hosts: - jaeger.example.com agent: strategy: DaemonSetApplication Instrumentation
Example of tracing in Go:
func main() { // Initialize tracer cfg := config.Configuration{ Sampler: &config.SamplerConfig{ Type: jaeger.SamplerTypeConst, Param: 1, }, Reporter: &config.ReporterConfig{ LogSpans: true, LocalAgentHostPort: "jaeger-agent:6831", }, } tracer, closer, err := cfg.NewTracer() defer closer.Close() opentracing.SetGlobalTracer(tracer)
// Create span span := tracer.StartSpan("operation_name") defer span.Finish()
ctx := opentracing.ContextWithSpan(context.Background(), span) // Your application code here}Metrics Collection
Custom metrics with Prometheus client:
var ( httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, )
requestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request duration in seconds", Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}, }, []string{"method", "endpoint"}, ))
func init() { prometheus.MustRegister(httpRequestsTotal) prometheus.MustRegister(requestDuration)}Grafana Dashboards
Our production dashboard configuration:
apiVersion: integreatly.org/v1alpha1kind: GrafanaDashboardmetadata: name: kubernetes-clusterspec: json: | { "annotations": { "list": [] }, "editable": true, "panels": [ { "title": "CPU Usage", "type": "graph", "datasource": "Prometheus", "targets": [ { "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod)", "legendFormat": "{{pod}}" } ] }, { "title": "Memory Usage", "type": "graph", "datasource": "Prometheus", "targets": [ { "expr": "sum(container_memory_usage_bytes{container!=\"\"}) by (pod)", "legendFormat": "{{pod}}" } ] } ] }Common Observability Issues
From my experience, here are frequent problems and solutions:
-
Data Volume
- Implement proper sampling
- Set retention policies
- Use appropriate storage
-
Performance Impact
- Monitor collector overhead
- Optimize collection intervals
- Use efficient exporters
-
Alert Fatigue
- Define meaningful thresholds
- Implement proper grouping
- Regular alert review
Production Checklist
✅ Monitoring
- Metrics collection
- Alert configuration
- Dashboard setup
- Resource monitoring
✅ Logging
- Log aggregation
- Search capabilities
- Retention policies
- Access controls
✅ Tracing
- Service instrumentation
- Sampling configuration
- Trace correlation
- Performance analysis
✅ Alerting
- Alert rules
- Notification channels
- On-call rotation
- Escalation policies
Real-world Example
Complete observability stack deployment:
---apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: name: prometheusspec: serviceAccountName: prometheus serviceMonitorSelector: matchLabels: team: platform resources: requests: memory: 400Mi limits: memory: 2Gi---apiVersion: apps/v1kind: DaemonSetmetadata: name: fluent-bitspec: template: spec: containers: - name: fluent-bit image: fluent/fluent-bit:1.9 volumeMounts: - name: varlog mountPath: /var/log - name: config mountPath: /fluent-bit/etc/ volumes: - name: varlog hostPath: path: /var/log - name: config configMap: name: fluent-bit-config---apiVersion: jaegertracing.io/v1kind: Jaegermetadata: name: jaegerspec: strategy: production storage: type: elasticsearchConclusion
Proper observability is essential for running reliable Kubernetes applications. Key takeaways from my experience:
- Implement comprehensive monitoring
- Collect meaningful metrics
- Configure appropriate alerts
- Use distributed tracing
- Regular system audits
In the next part, we’ll explore advanced patterns in Kubernetes, where I’ll share practical tips for implementing complex deployment strategies and architectural patterns.