Platform Engineering: Multi-Cluster Fleet Management — Cluster API, ArgoCD Hub-Spoke, and Disaster Recovery
Deep guide to managing Kubernetes fleets at scale — Cluster API for declarative cluster provisioning, ArgoCD hub-spoke architecture for multi-cluster GitOps, ApplicationSets for fleet-wide deployments, Velero for backup and disaster recovery, and cluster upgrade strategies.
When You Need Multi-Cluster
A single cluster works until it doesn't. Common forcing functions:
- Regulatory isolation: PCI-DSS, HIPAA, or SOC2 require dedicated clusters for sensitive workloads
- Blast radius reduction: a platform bug, a rogue workload, or a K8s upgrade failure in one cluster shouldn't take down everything
- Geographic distribution: latency requirements or data residency laws force per-region clusters
- Team autonomy: large orgs give teams or business units their own cluster for true isolation
- Kubernetes version testing: run a canary cluster on the new K8s version before upgrading production
The threshold: roughly 3+ clusters is where you need a fleet management strategy. Below that, each cluster can be managed independently. Above that, you need tooling.
Cluster API: Declarative Cluster Provisioning
Cluster API (CAPI) is Kubernetes managing Kubernetes. A management cluster runs CAPI controllers that provision and lifecycle-manage workload clusters, using CRDs.
Core concepts
Management Cluster:
├── Cluster CRD — top-level cluster object
├── KubeadmControlPlane — control plane definition
├── MachineDeployment — node pool definition
└── InfrastructureCluster (provider-specific):
├── AWSCluster / AWSMachineTemplate (AWS)
├── AzureCluster / AzureMachine (Azure)
├── vSphereCluster / VSphereMachine (vSphere)
└── DockerCluster / DockerMachine (local/CI)Creating a cluster
# cluster.yaml — declare a Kubernetes cluster
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-eu-west
namespace: clusters
spec:
clusterNetwork:
pods:
cidrBlocks: ["192.168.0.0/16"]
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-eu-west-control-plane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
name: production-eu-west
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
name: production-eu-west
namespace: clusters
spec:
region: eu-west-1
sshKeyName: platform-key
network:
vpc:
cidrBlock: "10.0.0.0/16"
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: production-eu-west-control-plane
namespace: clusters
spec:
replicas: 3 # HA control plane
version: v1.30.0
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachineTemplate
name: production-eu-west-control-plane
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
extraArgs:
audit-log-path: /var/log/kubernetes/audit.log
audit-policy-file: /etc/kubernetes/audit-policy.yaml
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: production-eu-west-workers
namespace: clusters
spec:
clusterName: production-eu-west
replicas: 5
selector:
matchLabels:
cluster.x-k8s.io/cluster-name: production-eu-west
template:
spec:
clusterName: production-eu-west
version: v1.30.0
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
name: production-eu-west-workers
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachineTemplate
name: production-eu-west-workersApply this, and CAPI provisions the entire cluster on AWS. Scale nodes: kubectl scale machinedeployment production-eu-west-workers --replicas=10. Upgrade Kubernetes: change .spec.version — CAPI performs a rolling upgrade.
GitOps for clusters themselves
Because clusters are CRDs, you can ArgoCD-sync cluster definitions from Git:
gitops-repo/
└── clusters/
├── production-eu-west/
│ └── cluster.yaml
├── production-us-east/
│ └── cluster.yaml
└── staging-eu-west/
└── cluster.yamlAn ArgoCD Application in the management cluster syncs clusters/ — new cluster YAML in Git = new cluster provisioned. This is GitOps all the way down.
ArgoCD Hub-Spoke: Fleet-Wide GitOps
The hub-spoke pattern uses one ArgoCD installation (hub) to deploy to all workload clusters (spokes).
Management/Hub Cluster:
└── ArgoCD
├── Cluster Secret: production-eu-west (kubeconfig)
├── Cluster Secret: production-us-east (kubeconfig)
├── Cluster Secret: staging-eu-west (kubeconfig)
└── ApplicationSets → deploys to all registered clustersRegistering spoke clusters
# Register a workload cluster with ArgoCD
argocd cluster add production-eu-west \
--kubeconfig /path/to/prod-eu-kubeconfig \
--name production-eu-west \
--label env=production \
--label region=eu-west
argocd cluster list
# SERVER NAME STATUS
# https://prod-eu.example.com production-eu-west Successful
# https://prod-us.example.com production-us-east Successful
# https://staging-eu.example.com staging-eu-west SuccessfulArgoCD stores cluster credentials as Secrets in the argocd namespace with the argocd.argoproj.io/secret-type: cluster label.
ApplicationSets: Deploy to the Fleet
Instead of creating one Application per service per cluster, ApplicationSets generate applications dynamically using generators.
Cluster generator: deploy to all production clusters
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: nginx-ingress-all-clusters
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
env: production # deploy to all clusters labeled env=production
template:
metadata:
name: "nginx-ingress-{{name}}"
spec:
project: platform
source:
repoURL: https://github.com/org/gitops-repo
targetRevision: HEAD
path: platform/nginx-ingress
helm:
valueFiles:
- "values-{{metadata.labels.region}}.yaml" # region-specific values
destination:
server: "{{server}}"
namespace: ingress-nginx
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueThis single ApplicationSet installs nginx-ingress on every production cluster automatically. Add a new production cluster → nginx-ingress deploys to it within 3 minutes.
Git directory generator: auto-register services
generators:
- matrix:
generators:
- git:
repoURL: https://github.com/org/gitops-repo
revision: HEAD
directories:
- path: apps/services/* # each service directory
- clusters:
selector:
matchLabels:
env: productionEvery service directory in apps/services/ × every production cluster = one Application. A developer creates apps/services/new-service/, and it deploys to all production clusters automatically.
Pull request generator: per-PR preview environments
generators:
- pullRequest:
github:
owner: org
repo: my-app
tokenRef:
secretName: github-token
key: token
requeueAfterSeconds: 60 # check for new PRs every minute
template:
metadata:
name: "pr-{{number}}-preview"
spec:
source:
repoURL: https://github.com/org/my-app
targetRevision: "{{head_sha}}"
path: k8s/preview
destination:
server: https://staging-eu.example.com
namespace: "preview-pr-{{number}}"Every PR gets an ArgoCD Application that deploys that PR's code to the staging cluster in an isolated namespace. Merged/closed PR → ApplicationSet deletes the Application and the namespace.
Multi-Cluster Networking
Pods in cluster A cannot natively reach pods in cluster B. Options:
Cilium Cluster Mesh
# Enable ClusterMesh on each cluster
cilium clustermesh enable --service-type LoadBalancer
# Connect two clusters
cilium clustermesh connect \
--destination-context production-eu-west \
--source-context production-us-east
# Verify
cilium clustermesh statusAfter connecting, pods can reach services across clusters by name:
order-service.default.svc.cluster.local # same cluster
order-service.default.svc.eu-west.local # cross-cluster (Cluster Mesh DNS)Use cases: active-active databases across regions, shared platform services (Vault, Grafana), cross-cluster traffic shifting for canary deploys.
Submariner (multi-CNI cross-cluster networking)
For clusters with different CNIs, Submariner creates cross-cluster tunnels:
subctl deploy-broker --kubeconfig hub.yaml
subctl join --kubeconfig cluster-a.yaml broker-info.subm
subctl join --kubeconfig cluster-b.yaml broker-info.submDisaster Recovery with Velero
Velero backs up Kubernetes objects and PVC snapshots. It's the final safety net if GitOps fails or a cluster needs full recovery.
Installation
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket velero-backups \
--backup-location-config region=eu-west-1 \
--snapshot-location-config region=eu-west-1 \
--secret-file ./credentials-veleroScheduled backups
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-full-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- "*" # all namespaces
excludedNamespaces:
- kube-system
- velero
includeClusterResources: true
storageLocation: default
volumeSnapshotLocations:
- default
ttl: 720h # 30 day retention
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: hourly-app-backup
namespace: velero
spec:
schedule: "0 * * * *" # hourly
template:
includedNamespaces:
- production
- payments
- orders
labelSelector:
matchLabels:
backup: "hourly" # only back up labeled namespaces hourly
ttl: 168h # 7 day retentionCross-region backup
# Backup location 2 — different AWS region for DR
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: dr-region
namespace: velero
spec:
provider: aws
objectStorage:
bucket: velero-backups-dr
config:
region: us-east-1
---
# Schedule mirrors backups to DR region
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: dr-backup
spec:
schedule: "30 2 * * *"
template:
storageLocation: dr-region
ttl: 720hRecovery
# Restore the entire cluster from a specific backup
velero restore create \
--from-backup daily-full-backup-20260611020000 \
--include-namespaces production,payments,orders \
--wait
# Check restore status
velero restore describe daily-full-backup-20260611020000-restoreDR Strategy: GitOps First, Velero Second
GitOps dramatically reduces recovery complexity:
RTO (Recovery Time Objective):
Without GitOps: restore Velero backup (1-4 hours) + manual config recovery
With GitOps: new cluster + ArgoCD sync from Git (15-30 minutes for apps)
+ Velero restore for stateful data only
RPO (Recovery Point Objective):
Application state: Git commit = last deployed state (0 data loss for config)
Stateful data: last Velero snapshot (1h RPO if hourly backups)What GitOps can replace in a recovery scenario:
- All Kubernetes manifests (Deployments, Services, Ingresses, ConfigMaps, Policies)
- Platform components (ArgoCD, Cilium, cert-manager, Kyverno)
- Application configuration
What still needs Velero:
- Persistent Volume data (databases, file storage)
- Kubernetes Secrets that ESO didn't sync yet
- Stateful workloads with data not in external systems
Cluster Upgrade Strategy at Fleet Scale
Upgrading 20 clusters without disruption requires a fleet-wide strategy.
The upgrade pipeline
Staging cluster → Canary production cluster → All production clusters
Stage 1 (Week 1): Upgrade staging cluster to K8s 1.31
- Run full workload tests
- Check API deprecations (kubent scan)
- Validate platform components (cert-manager, Cilium, ArgoCD) on 1.31
Stage 2 (Week 2): Upgrade one canary production cluster
- Smallest / least critical production cluster
- Monitor for 5 days: error rates, latency, unexpected behaviors
Stage 3 (Week 3+): Rolling upgrade all production clusters
- 2-3 clusters per day
- Monitor DORA metrics during upgrade window
- Automated rollback if SLO burns during upgradePre-flight checks before upgrading
# Check for deprecated API usage in all running manifests
kubent --target-version 1.31
# Example output:
# NAME NAMESPACE KIND API VERSION REPLACEMENT DEPRECATED
# ingress-old production Ingress networking.k8s.io/v1beta1 → networking.k8s.io/v1 REMOVED in 1.22
# Check Helm charts for deprecated APIs
helm-mapkubeapis release-name --namespace production
# Run Pluto: simpler deprecation scanner
pluto detect-helm --target-versions k8s=v1.31.0CAPI rolling node upgrade
# Update K8s version in cluster spec (GitOps: update the value in Git, ArgoCD syncs)
kubectl patch kubeadmcontrolplane production-eu-west-cp \
--type merge \
--patch '{"spec":{"version":"v1.31.0"}}'
# CAPI upgrades control plane nodes first (3 nodes), then worker nodes
# Watch progress
kubectl get machines -n clusters -w
# Status:
# production-eu-west-cp-abc Running v1.30.0
# production-eu-west-cp-xyz Running v1.31.0 ← new control plane node
# production-eu-west-cp-abc Deleting v1.30.0 ← old node being replacedFleet Observability
Managing 20 clusters means 20 Prometheus instances. Options:
Thanos: Global Prometheus query layer
Cluster 1: Prometheus → Thanos Sidecar → S3 (long-term storage)
Cluster 2: Prometheus → Thanos Sidecar → S3
...
Cluster 20: Prometheus → Thanos Sidecar → S3
Central Thanos Querier: query across all clusters simultaneously
Central Grafana: dashboards that span the entire fleet# Query metrics from all production clusters simultaneously
curl "http://thanos-querier.monitoring:9090/api/v1/query" \
--data-urlencode 'query=sum(rate(http_requests_total[5m])) by (cluster, service)'Grafana fleet overview dashboard
{
"panels": [
{
"title": "Cluster Health Overview",
"type": "table",
"targets": [
{
"expr": "kube_node_status_condition{condition='Ready', status='true'} == 0",
"legendFormat": "{{cluster}} - {{node}} NOT READY"
}
]
},
{
"title": "Workloads OutOfSync (ArgoCD)",
"type": "stat",
"targets": [
{
"expr": "sum by (cluster) (argocd_app_info{sync_status='OutOfSync'})"
}
]
}
]
}Platform Team Runbook: New Cluster Onboarding
Checklist for adding a new cluster to the fleet:
□ 1. Provision cluster (CAPI manifest in GitOps repo)
□ 2. Install CNI (Cilium) — per cluster Helm values in gitops-repo/clusters//
□ 3. Install cert-manager + ClusterIssuer
□ 4. Install External Secrets Operator + Vault ClusterSecretStore
□ 5. Install Kyverno + import platform ClusterPolicies
□ 6. Install ArgoCD Spoke registration (or ArgoCD App-of-Apps for platform components)
□ 7. Register cluster in ArgoCD hub (argocd cluster add)
□ 8. Apply default-deny CiliumClusterwideNetworkPolicy
□ 9. Install Velero + configure S3 backup location
□ 10. Configure Thanos sidecar for global metrics
□ 11. Add cluster to Grafana data sources
□ 12. Add cluster to fleet overview dashboard
□ 13. Verify: kubent scan (no deprecated APIs), kube-bench (CIS score), cilium connectivity test
□ 14. Tag cluster in ArgoCD: env, region, tier labels Time from "git commit with cluster YAML" to "cluster receiving workloads": ~25 minutes (CAPI provisioning ~15min + bootstrap ~10min).
Enjoyed this article?
Explore the Cloud & DevOps learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.