Learnixo
Back to blog
Cloud & DevOpsadvanced

Platform Engineering: Multi-Cluster Fleet Management — Cluster API, ArgoCD Hub-Spoke, and Disaster Recovery

Deep guide to managing Kubernetes fleets at scale — Cluster API for declarative cluster provisioning, ArgoCD hub-spoke architecture for multi-cluster GitOps, ApplicationSets for fleet-wide deployments, Velero for backup and disaster recovery, and cluster upgrade strategies.

LearnixoJune 11, 202610 min read
Platform EngineeringMulti-ClusterCluster APIArgoCDFleet ManagementDisaster RecoveryKubernetesGitOps
Share:𝕏

When You Need Multi-Cluster

A single cluster works until it doesn't. Common forcing functions:

  • Regulatory isolation: PCI-DSS, HIPAA, or SOC2 require dedicated clusters for sensitive workloads
  • Blast radius reduction: a platform bug, a rogue workload, or a K8s upgrade failure in one cluster shouldn't take down everything
  • Geographic distribution: latency requirements or data residency laws force per-region clusters
  • Team autonomy: large orgs give teams or business units their own cluster for true isolation
  • Kubernetes version testing: run a canary cluster on the new K8s version before upgrading production

The threshold: roughly 3+ clusters is where you need a fleet management strategy. Below that, each cluster can be managed independently. Above that, you need tooling.


Cluster API: Declarative Cluster Provisioning

Cluster API (CAPI) is Kubernetes managing Kubernetes. A management cluster runs CAPI controllers that provision and lifecycle-manage workload clusters, using CRDs.

Core concepts

Management Cluster:
  ├── Cluster CRD — top-level cluster object
  ├── KubeadmControlPlane — control plane definition
  ├── MachineDeployment — node pool definition
  └── InfrastructureCluster (provider-specific):
        ├── AWSCluster / AWSMachineTemplate  (AWS)
        ├── AzureCluster / AzureMachine       (Azure)
        ├── vSphereCluster / VSphereMachine   (vSphere)
        └── DockerCluster / DockerMachine     (local/CI)

Creating a cluster

YAML
# cluster.yaml  declare a Kubernetes cluster
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-eu-west
  namespace: clusters
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: production-eu-west-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: production-eu-west
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  name: production-eu-west
  namespace: clusters
spec:
  region: eu-west-1
  sshKeyName: platform-key
  network:
    vpc:
      cidrBlock: "10.0.0.0/16"
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: production-eu-west-control-plane
  namespace: clusters
spec:
  replicas: 3     # HA control plane
  version: v1.30.0
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AWSMachineTemplate
      name: production-eu-west-control-plane
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          audit-log-path: /var/log/kubernetes/audit.log
          audit-policy-file: /etc/kubernetes/audit-policy.yaml
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: production-eu-west-workers
  namespace: clusters
spec:
  clusterName: production-eu-west
  replicas: 5
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: production-eu-west
  template:
    spec:
      clusterName: production-eu-west
      version: v1.30.0
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: production-eu-west-workers
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AWSMachineTemplate
        name: production-eu-west-workers

Apply this, and CAPI provisions the entire cluster on AWS. Scale nodes: kubectl scale machinedeployment production-eu-west-workers --replicas=10. Upgrade Kubernetes: change .spec.version — CAPI performs a rolling upgrade.

GitOps for clusters themselves

Because clusters are CRDs, you can ArgoCD-sync cluster definitions from Git:

gitops-repo/
└── clusters/
    ├── production-eu-west/
    │   └── cluster.yaml
    ├── production-us-east/
    │   └── cluster.yaml
    └── staging-eu-west/
        └── cluster.yaml

An ArgoCD Application in the management cluster syncs clusters/ — new cluster YAML in Git = new cluster provisioned. This is GitOps all the way down.


ArgoCD Hub-Spoke: Fleet-Wide GitOps

The hub-spoke pattern uses one ArgoCD installation (hub) to deploy to all workload clusters (spokes).

Management/Hub Cluster:
  └── ArgoCD
        ├── Cluster Secret: production-eu-west (kubeconfig)
        ├── Cluster Secret: production-us-east (kubeconfig)
        ├── Cluster Secret: staging-eu-west (kubeconfig)
        └── ApplicationSets → deploys to all registered clusters

Registering spoke clusters

Bash
# Register a workload cluster with ArgoCD
argocd cluster add production-eu-west \
  --kubeconfig /path/to/prod-eu-kubeconfig \
  --name production-eu-west \
  --label env=production \
  --label region=eu-west

argocd cluster list
# SERVER                          NAME                    STATUS
# https://prod-eu.example.com     production-eu-west      Successful
# https://prod-us.example.com     production-us-east      Successful
# https://staging-eu.example.com  staging-eu-west         Successful

ArgoCD stores cluster credentials as Secrets in the argocd namespace with the argocd.argoproj.io/secret-type: cluster label.

ApplicationSets: Deploy to the Fleet

Instead of creating one Application per service per cluster, ApplicationSets generate applications dynamically using generators.

Cluster generator: deploy to all production clusters

YAML
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: nginx-ingress-all-clusters
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production   # deploy to all clusters labeled env=production
  template:
    metadata:
      name: "nginx-ingress-{{name}}"
    spec:
      project: platform
      source:
        repoURL: https://github.com/org/gitops-repo
        targetRevision: HEAD
        path: platform/nginx-ingress
        helm:
          valueFiles:
            - "values-{{metadata.labels.region}}.yaml"  # region-specific values
      destination:
        server: "{{server}}"
        namespace: ingress-nginx
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

This single ApplicationSet installs nginx-ingress on every production cluster automatically. Add a new production cluster → nginx-ingress deploys to it within 3 minutes.

Git directory generator: auto-register services

YAML
generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/org/gitops-repo
            revision: HEAD
            directories:
              - path: apps/services/*     # each service directory
        - clusters:
            selector:
              matchLabels:
                env: production

Every service directory in apps/services/ × every production cluster = one Application. A developer creates apps/services/new-service/, and it deploys to all production clusters automatically.

Pull request generator: per-PR preview environments

YAML
generators:
  - pullRequest:
      github:
        owner: org
        repo: my-app
        tokenRef:
          secretName: github-token
          key: token
      requeueAfterSeconds: 60    # check for new PRs every minute
  template:
    metadata:
      name: "pr-{{number}}-preview"
    spec:
      source:
        repoURL: https://github.com/org/my-app
        targetRevision: "{{head_sha}}"
        path: k8s/preview
      destination:
        server: https://staging-eu.example.com
        namespace: "preview-pr-{{number}}"

Every PR gets an ArgoCD Application that deploys that PR's code to the staging cluster in an isolated namespace. Merged/closed PR → ApplicationSet deletes the Application and the namespace.


Multi-Cluster Networking

Pods in cluster A cannot natively reach pods in cluster B. Options:

Cilium Cluster Mesh

Bash
# Enable ClusterMesh on each cluster
cilium clustermesh enable --service-type LoadBalancer

# Connect two clusters
cilium clustermesh connect \
  --destination-context production-eu-west \
  --source-context production-us-east

# Verify
cilium clustermesh status

After connecting, pods can reach services across clusters by name:

order-service.default.svc.cluster.local   # same cluster
order-service.default.svc.eu-west.local   # cross-cluster (Cluster Mesh DNS)

Use cases: active-active databases across regions, shared platform services (Vault, Grafana), cross-cluster traffic shifting for canary deploys.

Submariner (multi-CNI cross-cluster networking)

For clusters with different CNIs, Submariner creates cross-cluster tunnels:

Bash
subctl deploy-broker --kubeconfig hub.yaml
subctl join --kubeconfig cluster-a.yaml broker-info.subm
subctl join --kubeconfig cluster-b.yaml broker-info.subm

Disaster Recovery with Velero

Velero backs up Kubernetes objects and PVC snapshots. It's the final safety net if GitOps fails or a cluster needs full recovery.

Installation

Bash
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --backup-location-config region=eu-west-1 \
  --snapshot-location-config region=eu-west-1 \
  --secret-file ./credentials-velero

Scheduled backups

YAML
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-full-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"     # 2 AM daily
  template:
    includedNamespaces:
      - "*"                  # all namespaces
    excludedNamespaces:
      - kube-system
      - velero
    includeClusterResources: true
    storageLocation: default
    volumeSnapshotLocations:
      - default
    ttl: 720h                # 30 day retention
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: hourly-app-backup
  namespace: velero
spec:
  schedule: "0 * * * *"     # hourly
  template:
    includedNamespaces:
      - production
      - payments
      - orders
    labelSelector:
      matchLabels:
        backup: "hourly"     # only back up labeled namespaces hourly
    ttl: 168h                # 7 day retention

Cross-region backup

YAML
# Backup location 2  different AWS region for DR
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: dr-region
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero-backups-dr
  config:
    region: us-east-1
---
# Schedule mirrors backups to DR region
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: dr-backup
spec:
  schedule: "30 2 * * *"
  template:
    storageLocation: dr-region
    ttl: 720h

Recovery

Bash
# Restore the entire cluster from a specific backup
velero restore create \
  --from-backup daily-full-backup-20260611020000 \
  --include-namespaces production,payments,orders \
  --wait

# Check restore status
velero restore describe daily-full-backup-20260611020000-restore

DR Strategy: GitOps First, Velero Second

GitOps dramatically reduces recovery complexity:

RTO (Recovery Time Objective):
  Without GitOps: restore Velero backup (1-4 hours) + manual config recovery
  With GitOps:    new cluster + ArgoCD sync from Git (15-30 minutes for apps)
                  + Velero restore for stateful data only

RPO (Recovery Point Objective):
  Application state: Git commit = last deployed state (0 data loss for config)
  Stateful data: last Velero snapshot (1h RPO if hourly backups)

What GitOps can replace in a recovery scenario:

  • All Kubernetes manifests (Deployments, Services, Ingresses, ConfigMaps, Policies)
  • Platform components (ArgoCD, Cilium, cert-manager, Kyverno)
  • Application configuration

What still needs Velero:

  • Persistent Volume data (databases, file storage)
  • Kubernetes Secrets that ESO didn't sync yet
  • Stateful workloads with data not in external systems

Cluster Upgrade Strategy at Fleet Scale

Upgrading 20 clusters without disruption requires a fleet-wide strategy.

The upgrade pipeline

Staging cluster → Canary production cluster → All production clusters

Stage 1 (Week 1): Upgrade staging cluster to K8s 1.31
  - Run full workload tests
  - Check API deprecations (kubent scan)
  - Validate platform components (cert-manager, Cilium, ArgoCD) on 1.31

Stage 2 (Week 2): Upgrade one canary production cluster
  - Smallest / least critical production cluster
  - Monitor for 5 days: error rates, latency, unexpected behaviors

Stage 3 (Week 3+): Rolling upgrade all production clusters
  - 2-3 clusters per day
  - Monitor DORA metrics during upgrade window
  - Automated rollback if SLO burns during upgrade

Pre-flight checks before upgrading

Bash
# Check for deprecated API usage in all running manifests
kubent --target-version 1.31

# Example output:
# NAME               NAMESPACE   KIND       API VERSION  REPLACEMENT  DEPRECATED
# ingress-old        production  Ingress    networking.k8s.io/v1beta1   networking.k8s.io/v1  REMOVED in 1.22

# Check Helm charts for deprecated APIs
helm-mapkubeapis release-name --namespace production

# Run Pluto: simpler deprecation scanner
pluto detect-helm --target-versions k8s=v1.31.0

CAPI rolling node upgrade

Bash
# Update K8s version in cluster spec (GitOps: update the value in Git, ArgoCD syncs)
kubectl patch kubeadmcontrolplane production-eu-west-cp \
  --type merge \
  --patch '{"spec":{"version":"v1.31.0"}}'

# CAPI upgrades control plane nodes first (3 nodes), then worker nodes
# Watch progress
kubectl get machines -n clusters -w

# Status:
# production-eu-west-cp-abc   Running  v1.30.0
# production-eu-west-cp-xyz   Running  v1.31.0    new control plane node
# production-eu-west-cp-abc   Deleting v1.30.0    old node being replaced

Fleet Observability

Managing 20 clusters means 20 Prometheus instances. Options:

Thanos: Global Prometheus query layer

Cluster 1: Prometheus → Thanos Sidecar → S3 (long-term storage)
Cluster 2: Prometheus → Thanos Sidecar → S3
...
Cluster 20: Prometheus → Thanos Sidecar → S3

Central Thanos Querier: query across all clusters simultaneously
Central Grafana: dashboards that span the entire fleet
Bash
# Query metrics from all production clusters simultaneously
curl "http://thanos-querier.monitoring:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total[5m])) by (cluster, service)'

Grafana fleet overview dashboard

JSON
{
  "panels": [
    {
      "title": "Cluster Health Overview",
      "type": "table",
      "targets": [
        {
          "expr": "kube_node_status_condition{condition='Ready', status='true'} == 0",
          "legendFormat": "{{cluster}} - {{node}} NOT READY"
        }
      ]
    },
    {
      "title": "Workloads OutOfSync (ArgoCD)",
      "type": "stat",
      "targets": [
        {
          "expr": "sum by (cluster) (argocd_app_info{sync_status='OutOfSync'})"
        }
      ]
    }
  ]
}

Platform Team Runbook: New Cluster Onboarding

Checklist for adding a new cluster to the fleet:

□ 1. Provision cluster (CAPI manifest in GitOps repo)
□ 2. Install CNI (Cilium) — per cluster Helm values in gitops-repo/clusters//
□ 3. Install cert-manager + ClusterIssuer
□ 4. Install External Secrets Operator + Vault ClusterSecretStore
□ 5. Install Kyverno + import platform ClusterPolicies
□ 6. Install ArgoCD Spoke registration (or ArgoCD App-of-Apps for platform components)
□ 7. Register cluster in ArgoCD hub (argocd cluster add)
□ 8. Apply default-deny CiliumClusterwideNetworkPolicy
□ 9. Install Velero + configure S3 backup location
□ 10. Configure Thanos sidecar for global metrics
□ 11. Add cluster to Grafana data sources
□ 12. Add cluster to fleet overview dashboard
□ 13. Verify: kubent scan (no deprecated APIs), kube-bench (CIS score), cilium connectivity test
□ 14. Tag cluster in ArgoCD: env, region, tier labels

Time from "git commit with cluster YAML" to "cluster receiving workloads": ~25 minutes (CAPI provisioning ~15min + bootstrap ~10min).

Enjoyed this article?

Explore the Cloud & DevOps learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.