Platform Engineering Interview Guide — 30 Questions with Deep Answers
Comprehensive platform engineering interview prep — IDP design, GitOps, Backstage, Crossplane, service mesh, policy as code, DORA metrics, and senior scenario questions. With full answers and what interviewers actually want to hear.
What Interviewers Are Really Testing
Platform engineering interviews test three things:
- Technical depth: Can you design and operate a Kubernetes-based platform at scale?
- Product thinking: Do you treat developers as customers and measure outcomes?
- Systems thinking: Do you understand the interplay between tooling, team structure, and culture?
Most candidates fail on #2 and #3 — they know the tools but can't articulate why those tools matter or how they impact developers.
Foundations Questions
Q1: What is platform engineering and how is it different from DevOps?
What they want to hear:
Platform engineering is the discipline of building Internal Developer Platforms — self-service layers on top of infrastructure that reduce cognitive load for product teams. DevOps is a cultural practice of shared ownership; platform engineering is the product-level implementation of DevOps at scale.
The key difference: DevOps says "dev and ops work together." Platform engineering says "let's build tooling so product teams can do the ops themselves, without needing an ops expert embedded in every squad."
Sound senior by adding:
"The unit of value isn't the platform itself — it's the DORA metric improvement the platform enables. If deployment frequency goes from weekly to daily, that's the outcome. The platform is just the means."
Q2: What are the core pillars of an Internal Developer Platform?
Full answer:
- Deployment pipeline — CI/CD with opinionated defaults (GitHub Actions golden workflow, quality gates)
- Infrastructure self-service — provision databases, queues, caches via self-service forms, no tickets
- Environment management — on-demand dev/staging environments
- Secrets management — Vault + External Secrets Operator injecting credentials automatically
- Service catalog — Backstage for discovery, ownership, dependencies, API contracts
- Observability — shared Grafana/Loki/Tempo stack, zero-config for new services
- Security & policy — Kyverno/OPA in admission webhook, SBOM in CI, container scanning
Prioritization when starting from scratch: Start with deployment pipeline + secrets management. These eliminate the biggest developer friction points.
Q3: What is the "platform death spiral" and how do you avoid it?
Answer:
The death spiral: build features nobody asked for → low adoption → management questions value → team cut → platform degrades → teams return to manual chaos.
Avoid it with:
- Platform-as-a-Product mindset: product manager, quarterly roadmap, developer NPS surveys
- Measure adoption, not just availability
- OKRs tied to DORA metric improvement (deployment frequency up, lead time down)
- Monthly office hours with product teams
- Build the most requested thing, not the most technically interesting thing
Senior framing:
"The platform team's primary risk isn't a technical failure — it's being irrelevant. You can have perfect GitOps and nobody uses it. Adoption is the metric that matters."
GitOps Questions
Q4: Explain GitOps. What problem does it solve that traditional push-based CI/CD doesn't?
Answer:
GitOps: Git is the single source of truth for desired cluster state. An operator (ArgoCD, Flux) runs in the cluster and continuously reconciles actual state toward what's in Git.
Problems solved vs. push-based:
| Issue | Push (traditional) | GitOps |
|-------|------------------|--------|
| Manual kubectl edit in prod | Invisible drift | Detected, alerted, auto-remediated |
| Audit trail | In CI logs | In Git (who changed what, when, and why via PR) |
| Cluster credentials in CI | Stored in CI secrets | Cluster credentials never leave cluster |
| Recovery from outage | Re-run pipeline (if pipeline is up) | Re-point cluster to Git — everything redeploys |
Q5: You detect ArgoCD is showing OutOfSync on 15 production applications simultaneously. Walk through your incident response.
What to say:
- Don't blindly sync — 15 apps going OutOfSync at the same time means something changed at a shared level
- Check ArgoCD diff view — what changed? Is it in Git (intentional) or in the cluster (manual change)?
- If a Git commit triggered it: which commit? Check git log for the apps' source paths. Was it a shared Helm chart version bump or a change in a shared values file?
- If a manual cluster change: check K8s audit logs — who ran
kubectlon what resource? - If it was an unintended Git commit: revert the commit, then sync. Don't sync a bad state.
- If intentional (major release): coordinate syncing with teams, not all at once.
Sound senior:
"ArgoCD's OutOfSync is information, not an emergency on its own. The emergency is the root cause. I investigate before syncing. In production, I have self-heal disabled — every sync is intentional."
Q6: What is the App-of-Apps pattern in ArgoCD? When would you use ApplicationSets instead?
App-of-Apps: A root ArgoCD Application points to a directory of other Application YAMLs. ArgoCD manages them all. Adding a new application to the directory is all that's needed.
ApplicationSets: Template-based — generate Application CRDs dynamically using generators (Git directory, cluster list, pull request). Better for multi-cluster or when the number of apps grows large.
Use App-of-Apps when: You have a small number of clusters and applications, and want a simple, readable GitOps tree.
Use ApplicationSets when: You need to deploy the same app to 10 clusters, or you want services to auto-register when a developer creates a new directory in the gitops repo.
Backstage Questions
Q7: A developer on your team says "Backstage is just a fancy wiki, nobody uses it." How do you respond?
This is testing whether you understand the adoption challenge, not just the tooling.
The criticism is valid if Backstage is only used for static docs. The value comes from:
- Scaffolder — if developers use it to create every new service, they use Backstage every week
- Kubernetes plugin — developers check deployment status without opening ArgoCD/kubectl
- Ownership data — who do I call when service X is down? (eliminates Slack messages)
- API catalog — API contracts discoverable without reading source code
The real answer:
"If Backstage is just a wiki, we didn't finish building the IDP. The scaffolder must be the default way to create services, and the catalog must be kept current. If it's stale and manual, it'll feel like a wiki."
Q8: How do you structure a Backstage software template for a golden path?
Key elements:
- Parameters: service name, owner (OwnerPicker), tech stack choice, infrastructure options (database: yes/no)
- Steps:
fetch:template— copy skeleton files (repo template with Dockerfile, CI workflow, K8s manifests)publish:github— create the repo- ArgoCD registration (custom action or
argocd:create-resources) catalog:register— add catalog-info.yaml to Backstage
- Output links: repo URL, ArgoCD link, catalog entry
Sound senior:
"The skeleton matters as much as the template logic. Every file in the skeleton is a best-practice decision made once for every team. The Dockerfile should be multi-stage. The GitHub Actions workflow should include SAST, image scanning, and SBOM generation. The K8s manifests should have resource limits, probes, and PodDisruptionBudget. The template encodes your platform's opinions."
Infrastructure Self-Service Questions
Q9: How does Crossplane compare to Terraform for self-service infrastructure provisioning?
| | Crossplane | Terraform |
|--|-----------|-----------|
| Paradigm | Kubernetes CRDs | HCL files + CLI |
| Drift detection | Continuous (control loop) | Manual (terraform plan) |
| State management | In etcd (Kubernetes) | State file (S3/Terraform Cloud) |
| Developer interface | kubectl apply or Backstage form | terraform apply |
| Learning curve | Lower for K8s teams | Lower for infra teams |
When to choose Crossplane: Developer self-service is the primary goal — developers already use kubectl. Platform team manages Composite Resource Definitions (XRDs) centrally. You want continuous drift detection on infra.
When to keep Terraform: Complex infra with many providers. Non-K8s teams. Already have mature Terraform state management.
Sound senior:
"Crossplane is powerful, but XRDs have a learning curve for platform engineers. For most teams, a middle path works: Terraform for platform-level infra (VPC, clusters), Crossplane for developer self-service resources (databases, queues). You don't have to choose one."
Q10: Design a self-service database provisioning system that satisfies both developer experience and security/compliance requirements.
Architecture:
Developer creates PostgreSQLClaim YAML
→ OPA/Kyverno validates (required labels, allowed sizes, backup must be true)
→ Crossplane Composite Resource resolves to AWS RDS XRD
→ AWS RDS instance created with tags (team, cost-center, env)
→ Vault dynamic credentials generated (unique per app, 24h TTL, auto-rotated)
→ External Secrets Operator syncs credentials to app namespace as K8s Secret
→ Developer's app reads DB_HOST, DB_USER, DB_PASS from environmentSecurity requirements met:
- No static passwords (Vault dynamic secrets)
- Every resource tagged for cost attribution (enforced by Kyverno)
- Deletion protection required in production (enforced by policy)
- Audit trail: Vault audit log + CloudTrail for all credential access
Developer experience: Fill a Backstage form → database appears in 5 minutes → no tickets.
Service Mesh Questions
Q11: What is a service mesh? When would you choose Istio over Linkerd?
Service mesh: A sidecar proxy (or eBPF agent) injected into every Pod that intercepts all network traffic. Provides mTLS between services, traffic management, and per-connection observability — without changing application code.
Linkerd: Lightweight (Rust proxy), low overhead, easy to operate, covers mTLS + L7 metrics + basic retries. Right for most teams.
Istio: Feature-rich (Envoy proxy), complex, higher resource overhead. Right when you need: canary releases with precise traffic weighting, fault injection for chaos testing, JWT-based end-user authorization, multi-cluster service mesh.
Interview answer:
"I default to Linkerd. The value of mTLS and observability is immediate. Istio's advanced features are real, but the operational complexity is also real — you need dedicated platform engineers to operate it well. Most teams don't need VirtualService-level traffic control. When a team needs canary releases specifically, I add Argo Rollouts; it's more targeted than adopting full Istio."
Q12: How do you implement zero-trust networking in a Kubernetes platform?
Layers:
- mTLS (service-to-service): Istio or Linkerd injects sidecar — every connection is mutually authenticated and encrypted
- NetworkPolicy: default-deny all ingress and egress per namespace, whitelist explicitly what's needed
- Workload identity (SPIFFE/SPIRE): cryptographic identity per Pod (not per namespace) — enables fine-grained AuthorizationPolicy
- Istio AuthorizationPolicy:
payments-servicecan only call/api/v2oncheckout-service— path-level authorization - Egress gateway: all outbound traffic exits through a controlled node; allowlist external endpoints
Sound senior:
"Most teams achieve 'network segmentation' and call it zero trust — they have a NetworkPolicy per namespace that allows broad inter-namespace traffic. Real zero trust is per-service, per-path. You need workload identity for that, not just namespace identity. Start with Linkerd for mTLS + NetworkPolicies, then layer Istio AuthorizationPolicy as your security posture matures."
Policy as Code Questions
Q13: Compare Kyverno and OPA/Gatekeeper. When would you use each?
Kyverno:
- YAML-native policies (no separate language)
- Three modes: Validate (block invalid resources), Mutate (auto-modify resources), Generate (create related resources)
- Easier for platform engineers already familiar with Kubernetes
- Can mutate: auto-inject labels, add resource limits, inject sidecar containers
OPA/Gatekeeper:
- Rego language — Turing-complete, expressive, steeper learning curve
- Works beyond Kubernetes: Terraform, APIs, CI pipelines (unified policy across all)
- Better for complex cross-resource validation
Decision:
- Use Kyverno for Kubernetes-only policies, especially if you need mutation
- Use OPA when you need policy across Kubernetes + Terraform + other systems
Example Kyverno mutation policy (auto-add team label if missing):
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-team-label
spec:
rules:
- name: add-team-label
match:
any:
- resources:
kinds: [Deployment, StatefulSet]
mutate:
patchStrategicMerge:
metadata:
labels:
+(team): "unknown" # + = add only if missingObservability Questions
Q14: How would you build an observability platform that requires zero configuration from development teams?
Automatic (zero-config):
- Infrastructure metrics:
kube-state-metrics+node-exporter→ Prometheus — CPU/memory/network per Pod without any app changes - Access logs: Istio captures all L7 traffic logs — request rate, error rate, latency per service
- Distributed tracing: OTel auto-instrumentation sidecar injected by admission webhook — traces without code changes (Java agent, .NET agent)
- Application logs: Fluent Bit DaemonSet on every node ships container stdout to Loki
- Pre-built dashboards: Grafana dashboards provisioned via ConfigMap — every service gets RED metrics (Rate, Errors, Duration) automatically
Opt-in:
- Custom business metrics (revenue events, user actions)
- SLO alerting with error budgets
- Continuous profiling (Pyroscope)
- Synthetic monitoring
Sound senior:
"The zero-config part is only valuable if the automatic metrics answer the questions developers actually ask: 'Is my service up? What's my error rate? How long do requests take?' If those are answered without setup, the platform has real value. Custom business metrics are opt-in because they require domain knowledge — only the team knows what 'success' looks like for their business logic."
DORA & DevEx Questions
Q15: How do you justify the cost of a platform engineering team to a CFO?
Structure the business case as:
- Developer time saved: Platform eliminates N hours/week of toil per team. With 50 teams × 3 hours/week × $150/hour = $1.1M/year in recovered engineering capacity.
- Deployment frequency increase: From 2 deploys/week to 15 → features ship faster → revenue opportunity
- Incident reduction: MTTR from 4 hours to 45 minutes → less customer-facing downtime
- Infrastructure cost reduction: Right-sizing, autoscaling, decommissioning manual VM fleets
- Security incident prevention: Policy-as-code caught 12 misconfigurations before they reached production
Sound senior:
"The CFO care argument is: 8 platform engineers enabling 200 product engineers to work 15% more efficiently is equivalent to 30 additional product engineers. Platform engineering is a force multiplier. Frame it in headcount equivalent, not in tools."
System Design Questions
Q16: Design an Internal Developer Platform for a company with 100 developers across 25 product teams. Walk through your architecture decisions.
This is a senior-level scenario question. Structure your answer:
Phase 1: Foundation (Month 1-3)
- GitOps with ArgoCD (deploy infra, then apps)
- Secrets management: External Secrets Operator + Vault
- CI: GitHub Actions with golden workflow template
- Basic Kubernetes multi-tenancy: namespace per team, NetworkPolicy default-deny, LimitRanges
Phase 2: Self-Service (Month 3-6)
- Backstage: seed catalog with top 25 services
- 2-3 Software Templates (Node.js, .NET, Python)
- Crossplane for PostgreSQL self-service
- Kyverno policies: resource limits required, labels required
Phase 3: Observability (Month 4-6)
- OpenTelemetry Collector cluster-wide
- Grafana + Loki + Tempo
- Pre-built RED dashboards via Grafana provisioning
- Kubecost for team-level cost visibility
Phase 4: Advanced (Month 6-12)
- Linkerd service mesh (mTLS + L7 metrics)
- Argo Rollouts for canary deployments
- Progressive delivery for high-risk services
Team structure: 5-6 platform engineers (2 infra, 2 developer experience, 1 security, 1 product manager)
Adoption strategy:
"Don't mandate the platform. Make it better than the alternative. Embed a platform engineer in the first team to use the scaffolder — that team becomes an internal reference customer. Let success spread organically."
Q17: How do you handle the tension between platform standardization and team autonomy?
This is a culture/product question disguised as a technical one.
The tension: Platform team wants consistency (one way to deploy, one tech stack). Product teams want freedom (my team knows Node.js, don't force us to use .NET).
Resolution framework:
-
Standardize the infrastructure layer, not the application layer
- Standard: K8s manifests, CI/CD pipeline structure, secrets management, observability
- Flexible: programming language, framework, database choice (within reason)
-
Make the golden path easy, not mandatory
- A Backstage template for Node.js, one for .NET, one for Python
- Any other tech: use the "bring your own Dockerfile" escape hatch
- Document the escape hatch — make it deliberate, not forbidden
-
Measure consistency separately from adoption
- You want 100% of services using the secure secrets management system (non-negotiable)
- You want 80% of services using the golden path deployment pipeline (target, not mandate)
- You want teams to choose their tech stack (fully flexible)
Sound senior:
"The mistake is enforcing standards everywhere or nowhere. Non-negotiables: security (no secrets in code), observability (must emit traces), cost tagging. Everything else: strong default, clear escape hatch. Teams that feel micromanaged don't contribute back to the platform."
Quick-Fire Questions with Model Answers
Q: What's the difference between ArgoCD and Flux?
Both are GitOps operators. ArgoCD has a better UI and is more feature-rich for complex sync strategies. Flux is more lightweight and opinionated for pure GitOps. Choose ArgoCD when developer-facing UI matters; Flux when you want minimal footprint.
Q: What is SPIFFE/SPIRE?
SPIFFE (Secure Production Identity Framework For Everyone) assigns cryptographic identities to workloads independent of IP or namespace. SPIRE is the implementation. Used with Istio for per-Pod identity rather than per-namespace trust, enabling fine-grained AuthorizationPolicy.
Q: What is Kubecost?
Kubernetes cost visibility tool — shows CPU/memory cost per namespace, workload, and team. Essential for FinOps and chargeback in multi-tenant clusters. Integrates with Backstage to show each team their cloud spend.
Q: What is External Secrets Operator (ESO)?
K8s operator that syncs secrets from external stores (Vault, AWS Secrets Manager, Azure Key Vault) into Kubernetes Secrets. Eliminates secrets in YAML files committed to Git. Configured refresh interval triggers re-sync when secrets rotate.
Q: What is a Composite Resource Definition (XRD) in Crossplane?
An XRD lets platform engineers define a higher-level abstraction (e.g., PostgreSQLInstance) that composes multiple lower-level cloud resources (RDS instance + subnet group + parameter group + password in Vault). Developers claim a PostgreSQLInstance without knowing about the underlying AWS objects.
Q: What DORA level should a platform team target after 1 year?
At minimum, move the org from Low to Medium on all four metrics. Target for High: deployment frequency > weekly, lead time < 1 week, change failure rate < 10%, MTTR < 1 day. Elite is possible but requires both technical and cultural maturity across product teams, not just platform tooling.
How to Stand Out in a Platform Engineering Interview
Frame everything as outcomes, not tools:
- Weak: "I deployed Istio and it gave us mTLS"
- Strong: "We deployed Istio and reduced lateral movement risk — during our last security audit, we demonstrated that a compromised service couldn't reach unrelated services. That's why we chose Istio over just NetworkPolicies."
Show Platform-as-a-Product thinking:
- Weak: "We built a golden path template"
- Strong: "We built a golden path template, tracked adoption across 25 teams, ran monthly retrospectives, and updated the template 6 times in 6 months based on feedback. Adoption went from 40% to 90%."
Be honest about trade-offs:
- "Crossplane was the right call for us because our developers already used kubectl. If I were at a company with a strong Terraform culture and no Kubernetes expertise, I'd have kept Terraform."
Platform engineering interviews reward engineers who think like product managers as much as engineers who know Kubernetes.
Enjoyed this article?
Explore the Cloud & DevOps learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.