Kubernetes Troubleshooting Guide
/ 4 min read
Understanding Kubernetes Troubleshooting
Effective Kubernetes troubleshooting requires understanding of the architecture, components, and common failure points. This guide covers systematic approaches to identifying and resolving issues.
Cluster Health
1. Node Status
# Check node statuskubectl get nodeskubectl describe node <node-name>
# Check node conditionskubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[?(@.type=="Ready")].status2. Control Plane Components
# Check control plane podskubectl get pods -n kube-systemkubectl describe pod <pod-name> -n kube-system
# Check component statuskubectl get componentstatusesPod Issues
1. Pod Lifecycle
apiVersion: v1kind: Podmetadata: name: debug-podspec: containers: - name: main image: busybox command: ['sh', '-c', 'echo "Started"; sleep 3600'] initContainers: - name: init image: busybox command: ['sh', '-c', 'echo "Init complete"']2. Common Pod States
# Check pod statuskubectl get pods -o widekubectl describe pod <pod-name>
# Check pod logskubectl logs <pod-name> [-c container-name]kubectl logs <pod-name> -p # Previous container logs
# Debug with ephemeral containerkubectl debug -it <pod-name> --image=busybox --target=<container-name>Networking Issues
1. Service Connectivity
apiVersion: v1kind: Servicemetadata: name: debug-servicespec: selector: app: debug ports: - protocol: TCP port: 80 targetPort: 80802. Network Policy Debugging
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: debug-policyspec: podSelector: matchLabels: app: debug policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: role: frontend ports: - protocol: TCP port: 80Storage Issues
1. PersistentVolume Problems
apiVersion: v1kind: PersistentVolumemetadata: name: debug-pvspec: capacity: storage: 1Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: standard hostPath: path: /tmp/data2. Storage Class Issues
apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: debug-storageprovisioner: kubernetes.io/aws-ebsparameters: type: gp2reclaimPolicy: RetainallowVolumeExpansion: trueResource Management
1. Resource Constraints
apiVersion: v1kind: Podmetadata: name: resource-debugspec: containers: - name: app image: nginx resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m"2. Resource Quotas
apiVersion: v1kind: ResourceQuotametadata: name: compute-quotaspec: hard: requests.cpu: "4" requests.memory: 4Gi limits.cpu: "8" limits.memory: 8GiSecurity Issues
1. RBAC Debugging
apiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata: name: pod-readerrules:- apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"]---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: name: read-podssubjects:- kind: ServiceAccount name: defaultroleRef: kind: Role name: pod-reader apiGroup: rbac.authorization.k8s.io2. Security Context
apiVersion: v1kind: Podmetadata: name: security-debugspec: securityContext: runAsUser: 1000 runAsGroup: 3000 fsGroup: 2000 containers: - name: sec-ctx-demo image: busybox command: [ "sh", "-c", "sleep 1h" ] securityContext: allowPrivilegeEscalation: falseLogging and Monitoring
1. Logging Configuration
apiVersion: v1kind: Podmetadata: name: logging-podspec: containers: - name: counter image: busybox args: [/bin/sh, -c, 'i=0; while true; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done']2. Prometheus Monitoring
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: app-monitorspec: selector: matchLabels: app: myapp endpoints: - port: metricsPerformance Issues
1. CPU Profiling
# Using cAdvisorkubectl proxy &curl http://localhost:8001/api/v1/nodes/<node-name>/proxy/debug/pprof/profile > cpu.profile
# Using custom profilingkubectl exec <pod-name> -- curl http://localhost:6060/debug/pprof/profile > cpu.profile2. Memory Analysis
apiVersion: v1kind: Podmetadata: name: memory-debugspec: containers: - name: memory-hog image: k8s.gcr.io/stress:v1 args: - -mem-total - "1024Mi" - -mem-alloc-size - "100Mi" - -mem-alloc-sleep - "1s"Debugging Tools
1. Debug Container
apiVersion: v1kind: Podmetadata: name: debug-toolsspec: containers: - name: debug image: nicolaka/netshoot command: - sleep - "3600"2. Network Debugging
# Network connectivity testkubectl run test-dns --image=busybox:1.28 --rm -it -- nslookup kubernetes.default
# Port forward for debuggingkubectl port-forward service/my-service 8080:80
# Network policy testingkubectl run test-netpol --image=busybox --rm -it -- wget -qO- http://service-nameCommon Troubleshooting Patterns
1. Pod Crash Loop
# Check pod statuskubectl get pod <pod-name> -o yamlkubectl logs <pod-name> --previous
# Check eventskubectl get events --sort-by=.metadata.creationTimestamp2. Service Discovery Issues
# Debug DNSkubectl run dns-test --image=busybox:1.28 --rm -it -- nslookup kubernetes.default
# Check endpointskubectl get endpoints <service-name>
# Test service connectivitykubectl run curl --image=curlimages/curl --rm -it -- curl http://service-nameBest Practices
- Use Descriptive Pod Names
- Implement Proper Logging
- Set Resource Limits
- Use Liveness and Readiness Probes
- Implement Monitoring
- Regular Audit of RBAC
- Keep Documentation Updated
Conclusion
Effective troubleshooting in Kubernetes requires a systematic approach and understanding of various components. This guide provides a foundation for identifying and resolving common issues in Kubernetes clusters.
Series Navigation
- Previous: Kubernetes Backup and Recovery
- Series Complete! Start from Introduction to Kubernetes