Kubernetes Backup and Recovery
/ 4 min read
kubernetes , k8s , backup , recovery , disaster-recovery , devops , cloud-native , containers , series:kubernetes:19
Understanding Kubernetes Backup and Recovery
Kubernetes backup and recovery involves protecting both cluster state and application data to ensure business continuity and disaster recovery (BCDR).
Cluster State Backup
1. etcd Backup
# Backup etcdETCDCTL_API=3 etcdctl snapshot save snapshot.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
# Verify backupETCDCTL_API=3 etcdctl snapshot status snapshot.db2. Using Velero
apiVersion: velero.io/v1kind: Backupmetadata: name: daily-backup namespace: velerospec: includedNamespaces: - "*" excludedNamespaces: - kube-system storageLocation: default volumeSnapshotLocations: - default schedule: "0 1 * * *" ttl: 720hApplication Data Backup
1. Volume Snapshots
apiVersion: snapshot.storage.k8s.io/v1kind: VolumeSnapshotmetadata: name: data-snapshotspec: volumeSnapshotClassName: csi-hostpath-snapclass source: persistentVolumeClaimName: data-pvc2. Custom Backup Jobs
apiVersion: batch/v1kind: CronJobmetadata: name: db-backupspec: schedule: "0 2 * * *" jobTemplate: spec: template: spec: containers: - name: backup image: bitnami/mongodb:4.4 command: - /bin/sh - -c - | mongodump --uri="mongodb://mongodb:27017" --out="/backup/$(date +%Y%m%d)" volumeMounts: - name: backup mountPath: /backup volumes: - name: backup persistentVolumeClaim: claimName: backup-pvc restartPolicy: OnFailureDisaster Recovery
1. etcd Recovery
# Stop the API serversystemctl stop kube-apiserver
# Restore etcd snapshotETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --data-dir /var/lib/etcd-restore \ --initial-cluster="master-1=https://192.168.1.10:2380" \ --initial-advertise-peer-urls="https://192.168.1.10:2380" \ --name=master-1
# Update etcd configurationmv /var/lib/etcd /var/lib/etcd.bakmv /var/lib/etcd-restore /var/lib/etcd
# Restart etcd and API serversystemctl restart etcdsystemctl start kube-apiserver2. Using Velero for Recovery
apiVersion: velero.io/v1kind: Restoremetadata: name: restore-production namespace: velerospec: backupName: daily-backup includedNamespaces: - "*" excludedNamespaces: - kube-system restorePVs: trueBackup Strategies
1. Full Cluster Backup
apiVersion: velero.io/v1kind: Schedulemetadata: name: full-cluster-backup namespace: velerospec: schedule: "0 0 * * *" template: includedNamespaces: - "*" includedResources: - "*" includeClusterResources: true storageLocation: default volumeSnapshotLocations: - default ttl: 720h2. Selective Backup
apiVersion: velero.io/v1kind: Schedulemetadata: name: app-backup namespace: velerospec: schedule: "0 1 * * *" template: labelSelector: matchLabels: app: critical includedNamespaces: - production excludedResources: - secrets storageLocation: default ttl: 168hStorage Providers
1. AWS S3 Configuration
apiVersion: velero.io/v1kind: BackupStorageLocationmetadata: name: aws-backup namespace: velerospec: provider: aws objectStorage: bucket: my-backup-bucket config: region: us-west-2 profile: default2. Azure Blob Storage
apiVersion: velero.io/v1kind: BackupStorageLocationmetadata: name: azure-backup namespace: velerospec: provider: azure objectStorage: bucket: my-backup-container config: resourceGroup: my-resource-group storageAccount: my-storage-accountMonitoring and Alerts
1. Prometheus Rules
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: backup-alerts namespace: monitoringspec: groups: - name: backup rules: - alert: BackupFailed expr: | velero_backup_failure_total > 0 for: 1h labels: severity: critical annotations: summary: Backup failed description: Velero backup has failed2. Alert Manager Configuration
apiVersion: monitoring.coreos.com/v1alpha1kind: AlertmanagerConfigmetadata: name: backup-alerts namespace: monitoringspec: route: receiver: 'slack' routes: - match: alertname: BackupFailed receiver: 'pagerduty' receivers: - name: 'slack' slackConfigs: - channel: '#alerts' apiURL: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX' - name: 'pagerduty' pagerdutyConfigs: - serviceKey: '<key>'Best Practices
1. Backup Verification
apiVersion: batch/v1kind: CronJobmetadata: name: backup-verificationspec: schedule: "0 3 * * *" jobTemplate: spec: template: spec: containers: - name: verify image: velero/velero:latest command: - /bin/sh - -c - | velero backup describe --details latest-backup velero restore create --from-backup latest-backup --namespace-mappings prod:verify2. Retention Policy
apiVersion: velero.io/v1kind: Schedulemetadata: name: retention-policyspec: schedule: "0 0 * * *" template: ttl: 720h # 30 days hooks: resources: - name: backup-cleanup includedNamespaces: - "*" pre: - exec: command: - /bin/sh - -c - | # Clean up old backups velero backup delete --older-than 30d --confirmTroubleshooting
Common Issues
- Failed Backups
velero backup describe <backup-name>velero backup logs <backup-name>- Restore Issues
velero restore describe <restore-name>velero restore logs <restore-name>- Storage Issues
kubectl logs deployment/velero -n veleroSecurity Considerations
1. Encryption Configuration
apiVersion: velero.io/v1kind: BackupStorageLocationmetadata: name: encrypted-backupspec: provider: aws objectStorage: bucket: my-backup-bucket encryption: kmsKeyId: arn:aws:kms:region:account-id:key/key-id2. RBAC Configuration
apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: backup-operatorrules:- apiGroups: - velero.io resources: - backups - restores verbs: - create - delete - get - list - patch - update - watchConclusion
A robust backup and recovery strategy is crucial for maintaining the reliability and availability of Kubernetes clusters and applications. Regular testing of backup and restore procedures ensures that your disaster recovery plan works when needed.
Series Navigation
- Previous: Kubernetes Service Mesh (Istio)
- Next: Kubernetes Troubleshooting Guide