Senior System Design Interview: Identity Verification System

This is a real senior-level system design interview walkthrough. Each section is a follow-up question an interviewer might ask — answered at staff/lead level. Read it like a conversation.

The Opening Question

"Design a system for identity verification with high read traffic. How would you approach this?"

What the interviewer is really asking

They want to see if you understand:

System design — service separation, API design
Data modeling — schemas, consistency
Scalability — caching, indexing, read replicas
Compliance awareness — auditability, traceability

Strong answer

I would start by separating the system into a dedicated Verification Service responsible for handling identity validation, storing verification results, and exposing APIs for other parts of the system.

For the data layer, I would use a relational database (PostgreSQL) to store structured verification data — user identity attributes, verification states, timestamps, and audit logs. Given the compliance-heavy domain, data integrity and traceability are non-negotiable.

Since the system will have high read traffic (checking if a user is verified), I would optimize for reads by:

Introducing Redis caching for frequently accessed verification status and profile summaries
Structuring the data model to separate write-heavy and read-heavy workloads — normalized schema for writes, denormalized views for reads
Adding read replicas for larger scale

I would expose the verification service via a well-defined API layer, and introduce asynchronous processing (message queues) for document verification workflows or external API calls.

Bonus line: "I would also consider event-driven architecture for tracking verification changes and ensuring consistency across services."

Follow-up: Database Schema for Auditability

"What table structures and indexing strategies would you consider essential?"

Strong answer

I would design the schema to separate operational data from audit data.

Primary table (verification_records) — current state only:

SQL

CREATE TABLE verification_records (
  id               uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id          text NOT NULL,
  status           text NOT NULL, -- PENDING | IN_PROGRESS | VERIFIED | REJECTED
  version          int NOT NULL DEFAULT 1,
  updated_at       timestamptz DEFAULT now()
);

Append-only audit table (verification_audit_log):

SQL

CREATE TABLE verification_audit_log (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  entity_id    text NOT NULL,
  action       text NOT NULL, -- created | updated | verified | rejected
  performed_by text NOT NULL,
  previous_val jsonb,
  new_val      jsonb,
  created_at   timestamptz DEFAULT now()
);

Read model (user_verification_summary) — for fast lookups:

SQL

CREATE TABLE user_verification_summary (
  user_id            text PRIMARY KEY,
  verification_status text NOT NULL,
  last_verified_at   timestamptz
);

Indexing strategy:

SQL

-- Operational queries
CREATE INDEX idx_verifications_user ON verification_records(user_id);
CREATE INDEX idx_verifications_status ON verification_records(user_id, status);

-- Audit history
CREATE INDEX idx_audit_entity_time ON verification_audit_log(entity_id, created_at DESC);

For very large audit logs, use table partitioning by date to keep queries fast.

Follow-up: Redis Caching Strategy

"How would you integrate Redis to balance low-latency reads with auditable, consistent data?"

Strong answer

Redis acts as a read-optimization layer only — the relational database remains the single source of truth for all writes and audit data.

Write flow (consistency first):

Request → Backend
Write to PostgreSQL (source of truth)
Persist audit log (append-only)
Invalidate or update cache

Cache strategy — cache-aside (lazy loading):

On read: check Redis → cache miss → fetch from DB → store in Redis
On write: update DB → invalidate relevant cache key

Cache invalidation:

Option A — invalidate on write (most common): After any update, delete the affected key. Next read repopulates from DB.

Option B — update cache directly: For simple fields (e.g., status), update Redis immediately after DB write.

TTL: Always set TTL (5–15 min) as a safety net against missed invalidations — but never rely on TTL alone for correctness.

Namespaced keys:

user:123:verification_status
user:123:profile_summary

Critical rule: Audit logs are never cached as source of truth. Redis only caches derived or read models.

Follow-up: Maintaining Consistency at High Volume

"When audit-related data changes frequently, how do you avoid stale cache?"

Strong answer

I would combine write-first consistency, targeted invalidation, and event-driven updates:

Write-first — all changes go to the database first including audit log. Only after commit do I touch the cache.
Targeted key invalidation — invalidate only the affected key (e.g., user:123:verification), not a global cache flush.
Event-driven invalidation (for scale) — after a DB write, publish a UserVerificationUpdated event. Consumers listen and invalidate/update cache. This decouples write logic from cache management.
TTL as fallback — 5–10 minutes ensures eventual consistency if an invalidation is missed.
Versioning — include updated_at or version in cached objects. Only overwrite cache if the new data is newer.
Never cache raw audit data — cache only aggregated views like verification_status.

Follow-up: Queue Architecture for Background Jobs

"How would you structure the queue system for high throughput with strict security?"

Strong answer

I would use an event-driven pipeline with a secure queue layer and isolated worker services.

Queue topology (separate workloads):

high_priority_verification — real-time identity checks
standard_verification — normal flow
low_priority_tasks — re-validation, background enrichment

Kafka → separate topics with partitions. RabbitMQ → multiple queues with routing keys.

Message structure (idempotent):

JSON

{
  "job_id": "uuid",
  "verification_id": "ref-only",
  "type": "IDENTITY_CHECK",
  "version": 1
}

Pass only references (IDs) — workers fetch sensitive data securely from DB.

Retry strategy:

fail → retry queue (1s) → retry queue (5s) → retry queue (30s) → Dead Letter Queue

Security at queue level:

TLS + credentials for all connections
ACLs / IAM roles — restrict which services can publish/consume
Service tokens — producers sign messages, workers validate origin

Throughput:

Partition by verification_id — same entity → same partition → ordered processing
Scale consumers horizontally
Stateless workers

Follow-up: Preventing Race Conditions

"How do you ensure workers don't process conflicting jobs simultaneously?"

Strong answer

1. Unique constraints + partial index (prevent duplicate active jobs):

SQL

CREATE UNIQUE INDEX unique_active_job_per_verification
ON verification_jobs (verification_id)
WHERE status IN ('PENDING', 'IN_PROGRESS');

2. Row-level locking:

SQL

SELECT * FROM verifications
WHERE id = $1
FOR UPDATE;

3. Optimistic concurrency control:

SQL

UPDATE verifications
SET status = $new_status, version = version + 1
WHERE id = $id AND version = $expected_version;
-- 0 rows affected = another worker got there first → safe retry

4. Valid state transitions only:

PENDING → IN_PROGRESS → VERIFIED / REJECTED

Enforced via transaction-level validation or DB constraints.

5. Atomic processing — each job in one transaction:

Lock row (FOR UPDATE)
Validate current state
Update verification status
Insert audit log (append-only)
Commit

Workers are idempotent (keyed by job_id). At-least-once delivery is safe because DB constraints prevent duplicates.

Follow-up: CI/CD Pipeline Design

"How would you design a pipeline that supports gated releases, incident readiness, and rollback while keeping iteration fast?"

Strong answer

Environment strategy:

dev → rapid iteration, feature branches
staging → production-like validation
production → controlled releases

CI pipeline (every PR):

Linting + static analysis
Unit + integration tests
Docker image build
Security scan (dependencies, SAST)
All must pass before merge

CD pipeline (after merge to main):

Auto-deploy to staging
Run integration + smoke tests
Manual approval (or auto checks) before production
Progressive delivery: canary at 5–10% traffic → promote to 100%

Rollback options:

Blue-green: instant traffic switch back to previous environment
Redeploy versioned Docker image (app:1.2.3)
Feature flags: disable without redeploying

Keeping iteration fast:

Parallelize CI jobs
Cache dependencies
Auto-deploy to dev/staging

Follow-up: Automated Rollback During Canary

"What metrics would you monitor and how would you automate the rollback decision?"

Strong answer

Core principle: Compare canary vs stable baseline, not absolute values.

Metrics to monitor:

| Priority | Metric | Rollback trigger | |----------|--------|-----------------| | Highest | Error rate (5xx) | >2–3× baseline | | High | P95/P99 latency | >30–50% increase | | High | Business KPIs (failed logins, conversions) | Significant drop | | Medium | Throughput / success rate | Sustained drop | | Lower | CPU/memory, queue backlog | Spike over window |

Automated flow:

Deploy canary (5–10% traffic)
Monitor all metrics continuously
Compare against baseline
If thresholds breached for sustained window (2–5 min): auto-rollback or pause + alert
If healthy: promote to 100%

Safety mechanisms:

Require multi-metric confirmation (not just one spike)
Time window prevents reacting to momentary noise
Combine system + business signals

Follow-up: Rollback Safety

"How do you validate that rollback itself doesn't introduce inconsistencies?"

Strong answer

Design rollback as a first-class path:

1. ACID transactions — no partial writes. Failed mid-way = DB rollback.

2. Backward-compatible database migrations — expand → migrate → contract pattern:

Expand: add new columns (safe for both app versions)
Migrate: write to both old + new
Contract: remove old columns in a later release

If rollback happens, the old app version still works with the current schema.

3. Idempotent background jobs — safe to replay. Record state transitions so partially completed work is visible and recoverable.

4. Saga pattern for multi-step workflows — if a step fails, emit compensating events to undo prior steps.

5. Test rollback in staging — don't discover rollback failures during an incident.

6. DLQ inspection — failed jobs go to dead-letter queue for manual inspection before replay.

Summary: What This Shows the Interviewer

| Dimension | What you demonstrated | |-----------|----------------------| | System thinking | Service separation, API design | | Backend depth | PostgreSQL, indexing, read replicas, transactions | | Scalability | Redis cache-aside, queue partitioning, horizontal scaling | | Compliance | Append-only audit logs, atomic writes | | Operational maturity | CI/CD gates, canary deployments, rollback-safe schema evolution | | Concurrency | Row locking, optimistic versioning, idempotency |

This is how a Senior Backend Lead or Staff Engineer answers system design questions.

Senior System Design Interview: Identity Verification System

The Opening Question

What the interviewer is really asking

Strong answer

Follow-up: Database Schema for Auditability

Strong answer

Follow-up: Redis Caching Strategy

Strong answer

Follow-up: Maintaining Consistency at High Volume

Strong answer

Follow-up: Queue Architecture for Background Jobs

Strong answer

Follow-up: Preventing Race Conditions

Strong answer

Follow-up: CI/CD Pipeline Design

Strong answer

Follow-up: Automated Rollback During Canary

Strong answer

Follow-up: Rollback Safety

Strong answer

Summary: What This Shows the Interviewer

Enjoyed this article?

Leave a comment