Platform Engineering: Multi-Cluster Fleet Management — Cluster API, ArgoCD Hub-Spoke, and Disaster Recovery

When You Need Multi-Cluster

A single cluster works until it doesn't. Common forcing functions:

Regulatory isolation: PCI-DSS, HIPAA, or SOC2 require dedicated clusters for sensitive workloads
Blast radius reduction: a platform bug, a rogue workload, or a K8s upgrade failure in one cluster shouldn't take down everything
Geographic distribution: latency requirements or data residency laws force per-region clusters
Team autonomy: large orgs give teams or business units their own cluster for true isolation
Kubernetes version testing: run a canary cluster on the new K8s version before upgrading production

The threshold: roughly 3+ clusters is where you need a fleet management strategy. Below that, each cluster can be managed independently. Above that, you need tooling.

Cluster API: Declarative Cluster Provisioning

Cluster API (CAPI) is Kubernetes managing Kubernetes. A management cluster runs CAPI controllers that provision and lifecycle-manage workload clusters, using CRDs.

Core concepts

Management Cluster:
  ├── Cluster CRD — top-level cluster object
  ├── KubeadmControlPlane — control plane definition
  ├── MachineDeployment — node pool definition
  └── InfrastructureCluster (provider-specific):
        ├── AWSCluster / AWSMachineTemplate  (AWS)
        ├── AzureCluster / AzureMachine       (Azure)
        ├── vSphereCluster / VSphereMachine   (vSphere)
        └── DockerCluster / DockerMachine     (local/CI)

Creating a cluster

YAML

# cluster.yaml — declare a Kubernetes cluster
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-eu-west
  namespace: clusters
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: production-eu-west-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: production-eu-west
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  name: production-eu-west
  namespace: clusters
spec:
  region: eu-west-1
  sshKeyName: platform-key
  network:
    vpc:
      cidrBlock: "10.0.0.0/16"
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: production-eu-west-control-plane
  namespace: clusters
spec:
  replicas: 3     # HA control plane
  version: v1.30.0
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AWSMachineTemplate
      name: production-eu-west-control-plane
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          audit-log-path: /var/log/kubernetes/audit.log
          audit-policy-file: /etc/kubernetes/audit-policy.yaml
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: production-eu-west-workers
  namespace: clusters
spec:
  clusterName: production-eu-west
  replicas: 5
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: production-eu-west
  template:
    spec:
      clusterName: production-eu-west
      version: v1.30.0
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: production-eu-west-workers
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AWSMachineTemplate
        name: production-eu-west-workers

Apply this, and CAPI provisions the entire cluster on AWS. Scale nodes: kubectl scale machinedeployment production-eu-west-workers --replicas=10. Upgrade Kubernetes: change .spec.version — CAPI performs a rolling upgrade.

GitOps for clusters themselves

Because clusters are CRDs, you can ArgoCD-sync cluster definitions from Git:

gitops-repo/
└── clusters/
    ├── production-eu-west/
    │   └── cluster.yaml
    ├── production-us-east/
    │   └── cluster.yaml
    └── staging-eu-west/
        └── cluster.yaml

An ArgoCD Application in the management cluster syncs clusters/ — new cluster YAML in Git = new cluster provisioned. This is GitOps all the way down.

ArgoCD Hub-Spoke: Fleet-Wide GitOps

The hub-spoke pattern uses one ArgoCD installation (hub) to deploy to all workload clusters (spokes).

Management/Hub Cluster:
  └── ArgoCD
        ├── Cluster Secret: production-eu-west (kubeconfig)
        ├── Cluster Secret: production-us-east (kubeconfig)
        ├── Cluster Secret: staging-eu-west (kubeconfig)
        └── ApplicationSets → deploys to all registered clusters

Registering spoke clusters

Bash

# Register a workload cluster with ArgoCD
argocd cluster add production-eu-west \
  --kubeconfig /path/to/prod-eu-kubeconfig \
  --name production-eu-west \
  --label env=production \
  --label region=eu-west

argocd cluster list
# SERVER                          NAME                    STATUS
# https://prod-eu.example.com     production-eu-west      Successful
# https://prod-us.example.com     production-us-east      Successful
# https://staging-eu.example.com  staging-eu-west         Successful

ArgoCD stores cluster credentials as Secrets in the argocd namespace with the argocd.argoproj.io/secret-type: cluster label.

ApplicationSets: Deploy to the Fleet

Instead of creating one Application per service per cluster, ApplicationSets generate applications dynamically using generators.

Cluster generator: deploy to all production clusters

YAML

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: nginx-ingress-all-clusters
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production   # deploy to all clusters labeled env=production
  template:
    metadata:
      name: "nginx-ingress-{{name}}"
    spec:
      project: platform
      source:
        repoURL: https://github.com/org/gitops-repo
        targetRevision: HEAD
        path: platform/nginx-ingress
        helm:
          valueFiles:
            - "values-{{metadata.labels.region}}.yaml"  # region-specific values
      destination:
        server: "{{server}}"
        namespace: ingress-nginx
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

This single ApplicationSet installs nginx-ingress on every production cluster automatically. Add a new production cluster → nginx-ingress deploys to it within 3 minutes.

Git directory generator: auto-register services

YAML

generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/org/gitops-repo
            revision: HEAD
            directories:
              - path: apps/services/*     # each service directory
        - clusters:
            selector:
              matchLabels:
                env: production

Every service directory in apps/services/ × every production cluster = one Application. A developer creates apps/services/new-service/, and it deploys to all production clusters automatically.

Pull request generator: per-PR preview environments

YAML

generators:
  - pullRequest:
      github:
        owner: org
        repo: my-app
        tokenRef:
          secretName: github-token
          key: token
      requeueAfterSeconds: 60    # check for new PRs every minute
  template:
    metadata:
      name: "pr-{{number}}-preview"
    spec:
      source:
        repoURL: https://github.com/org/my-app
        targetRevision: "{{head_sha}}"
        path: k8s/preview
      destination:
        server: https://staging-eu.example.com
        namespace: "preview-pr-{{number}}"

Every PR gets an ArgoCD Application that deploys that PR's code to the staging cluster in an isolated namespace. Merged/closed PR → ApplicationSet deletes the Application and the namespace.

Multi-Cluster Networking

Pods in cluster A cannot natively reach pods in cluster B. Options:

Cilium Cluster Mesh

Bash

# Enable ClusterMesh on each cluster
cilium clustermesh enable --service-type LoadBalancer

# Connect two clusters
cilium clustermesh connect \
  --destination-context production-eu-west \
  --source-context production-us-east

# Verify
cilium clustermesh status

After connecting, pods can reach services across clusters by name:

order-service.default.svc.cluster.local   # same cluster
order-service.default.svc.eu-west.local   # cross-cluster (Cluster Mesh DNS)

Use cases: active-active databases across regions, shared platform services (Vault, Grafana), cross-cluster traffic shifting for canary deploys.

Submariner (multi-CNI cross-cluster networking)

For clusters with different CNIs, Submariner creates cross-cluster tunnels:

Bash

subctl deploy-broker --kubeconfig hub.yaml
subctl join --kubeconfig cluster-a.yaml broker-info.subm
subctl join --kubeconfig cluster-b.yaml broker-info.subm

Disaster Recovery with Velero

Velero backs up Kubernetes objects and PVC snapshots. It's the final safety net if GitOps fails or a cluster needs full recovery.

Installation

Bash

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --backup-location-config region=eu-west-1 \
  --snapshot-location-config region=eu-west-1 \
  --secret-file ./credentials-velero

Scheduled backups

YAML

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-full-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"     # 2 AM daily
  template:
    includedNamespaces:
      - "*"                  # all namespaces
    excludedNamespaces:
      - kube-system
      - velero
    includeClusterResources: true
    storageLocation: default
    volumeSnapshotLocations:
      - default
    ttl: 720h                # 30 day retention
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: hourly-app-backup
  namespace: velero
spec:
  schedule: "0 * * * *"     # hourly
  template:
    includedNamespaces:
      - production
      - payments
      - orders
    labelSelector:
      matchLabels:
        backup: "hourly"     # only back up labeled namespaces hourly
    ttl: 168h                # 7 day retention

Cross-region backup

YAML

# Backup location 2 — different AWS region for DR
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: dr-region
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero-backups-dr
  config:
    region: us-east-1
---
# Schedule mirrors backups to DR region
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: dr-backup
spec:
  schedule: "30 2 * * *"
  template:
    storageLocation: dr-region
    ttl: 720h

Recovery

Bash

# Restore the entire cluster from a specific backup
velero restore create \
  --from-backup daily-full-backup-20260611020000 \
  --include-namespaces production,payments,orders \
  --wait

# Check restore status
velero restore describe daily-full-backup-20260611020000-restore

DR Strategy: GitOps First, Velero Second

GitOps dramatically reduces recovery complexity:

RTO (Recovery Time Objective):
  Without GitOps: restore Velero backup (1-4 hours) + manual config recovery
  With GitOps:    new cluster + ArgoCD sync from Git (15-30 minutes for apps)
                  + Velero restore for stateful data only

RPO (Recovery Point Objective):
  Application state: Git commit = last deployed state (0 data loss for config)
  Stateful data: last Velero snapshot (1h RPO if hourly backups)

What GitOps can replace in a recovery scenario:

All Kubernetes manifests (Deployments, Services, Ingresses, ConfigMaps, Policies)
Platform components (ArgoCD, Cilium, cert-manager, Kyverno)
Application configuration

What still needs Velero:

Persistent Volume data (databases, file storage)
Kubernetes Secrets that ESO didn't sync yet
Stateful workloads with data not in external systems

Cluster Upgrade Strategy at Fleet Scale

Upgrading 20 clusters without disruption requires a fleet-wide strategy.

The upgrade pipeline

Staging cluster → Canary production cluster → All production clusters

Stage 1 (Week 1): Upgrade staging cluster to K8s 1.31
  - Run full workload tests
  - Check API deprecations (kubent scan)
  - Validate platform components (cert-manager, Cilium, ArgoCD) on 1.31

Stage 2 (Week 2): Upgrade one canary production cluster
  - Smallest / least critical production cluster
  - Monitor for 5 days: error rates, latency, unexpected behaviors

Stage 3 (Week 3+): Rolling upgrade all production clusters
  - 2-3 clusters per day
  - Monitor DORA metrics during upgrade window
  - Automated rollback if SLO burns during upgrade

Pre-flight checks before upgrading

Bash

# Check for deprecated API usage in all running manifests
kubent --target-version 1.31

# Example output:
# NAME               NAMESPACE   KIND       API VERSION  REPLACEMENT  DEPRECATED
# ingress-old        production  Ingress    networking.k8s.io/v1beta1  → networking.k8s.io/v1  REMOVED in 1.22

# Check Helm charts for deprecated APIs
helm-mapkubeapis release-name --namespace production

# Run Pluto: simpler deprecation scanner
pluto detect-helm --target-versions k8s=v1.31.0

CAPI rolling node upgrade

Bash

# Update K8s version in cluster spec (GitOps: update the value in Git, ArgoCD syncs)
kubectl patch kubeadmcontrolplane production-eu-west-cp \
  --type merge \
  --patch '{"spec":{"version":"v1.31.0"}}'

# CAPI upgrades control plane nodes first (3 nodes), then worker nodes
# Watch progress
kubectl get machines -n clusters -w

# Status:
# production-eu-west-cp-abc   Running  v1.30.0
# production-eu-west-cp-xyz   Running  v1.31.0   ← new control plane node
# production-eu-west-cp-abc   Deleting v1.30.0   ← old node being replaced

Fleet Observability

Managing 20 clusters means 20 Prometheus instances. Options:

Thanos: Global Prometheus query layer

Cluster 1: Prometheus → Thanos Sidecar → S3 (long-term storage)
Cluster 2: Prometheus → Thanos Sidecar → S3
...
Cluster 20: Prometheus → Thanos Sidecar → S3

Central Thanos Querier: query across all clusters simultaneously
Central Grafana: dashboards that span the entire fleet

Bash

# Query metrics from all production clusters simultaneously
curl "http://thanos-querier.monitoring:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total[5m])) by (cluster, service)'

Grafana fleet overview dashboard

JSON

{
  "panels": [
    {
      "title": "Cluster Health Overview",
      "type": "table",
      "targets": [
        {
          "expr": "kube_node_status_condition{condition='Ready', status='true'} == 0",
          "legendFormat": "{{cluster}} - {{node}} NOT READY"
        }
      ]
    },
    {
      "title": "Workloads OutOfSync (ArgoCD)",
      "type": "stat",
      "targets": [
        {
          "expr": "sum by (cluster) (argocd_app_info{sync_status='OutOfSync'})"
        }
      ]
    }
  ]
}

Platform Team Runbook: New Cluster Onboarding

Checklist for adding a new cluster to the fleet:

□ 1. Provision cluster (CAPI manifest in GitOps repo)
□ 2. Install CNI (Cilium) — per cluster Helm values in gitops-repo/clusters//
□ 3. Install cert-manager + ClusterIssuer
□ 4. Install External Secrets Operator + Vault ClusterSecretStore
□ 5. Install Kyverno + import platform ClusterPolicies
□ 6. Install ArgoCD Spoke registration (or ArgoCD App-of-Apps for platform components)
□ 7. Register cluster in ArgoCD hub (argocd cluster add)
□ 8. Apply default-deny CiliumClusterwideNetworkPolicy
□ 9. Install Velero + configure S3 backup location
□ 10. Configure Thanos sidecar for global metrics
□ 11. Add cluster to Grafana data sources
□ 12. Add cluster to fleet overview dashboard
□ 13. Verify: kubent scan (no deprecated APIs), kube-bench (CIS score), cilium connectivity test
□ 14. Tag cluster in ArgoCD: env, region, tier labels

Time from "git commit with cluster YAML" to "cluster receiving workloads": ~25 minutes (CAPI provisioning ~15min + bootstrap ~10min).

Platform Engineering: Multi-Cluster Fleet Management — Cluster API, ArgoCD Hub-Spoke, and Disaster Recovery

When You Need Multi-Cluster

Cluster API: Declarative Cluster Provisioning

Core concepts

Creating a cluster

GitOps for clusters themselves

ArgoCD Hub-Spoke: Fleet-Wide GitOps

Registering spoke clusters

ApplicationSets: Deploy to the Fleet

Cluster generator: deploy to all production clusters

Git directory generator: auto-register services

Pull request generator: per-PR preview environments

Multi-Cluster Networking

Cilium Cluster Mesh

Submariner (multi-CNI cross-cluster networking)

Disaster Recovery with Velero

Installation

Scheduled backups

Cross-region backup

Recovery

DR Strategy: GitOps First, Velero Second

Cluster Upgrade Strategy at Fleet Scale

The upgrade pipeline

Pre-flight checks before upgrading

CAPI rolling node upgrade

Fleet Observability

Thanos: Global Prometheus query layer

Grafana fleet overview dashboard

Platform Team Runbook: New Cluster Onboarding

Enjoyed this article?

Leave a comment