Back to blog
System Designadvanced

Senior System Design Interview: Identity Verification System

Full senior/staff-level interview walkthrough — design an identity verification system with high read traffic, auditability, Redis caching, queue architecture, CI/CD pipeline, and rollback strategy. Real answers you can use.

LearnixoApril 18, 20269 min read
System DesignInterview PrepArchitectureDatabasesRedisQueuesCI/CDSenior
Share:š•

This is a real senior-level system design interview walkthrough. Each section is a follow-up question an interviewer might ask — answered at staff/lead level. Read it like a conversation.


The Opening Question

"Design a system for identity verification with high read traffic. How would you approach this?"

What the interviewer is really asking

They want to see if you understand:

  • System design — service separation, API design
  • Data modeling — schemas, consistency
  • Scalability — caching, indexing, read replicas
  • Compliance awareness — auditability, traceability

Strong answer

I would start by separating the system into a dedicated Verification Service responsible for handling identity validation, storing verification results, and exposing APIs for other parts of the system.

For the data layer, I would use a relational database (PostgreSQL) to store structured verification data — user identity attributes, verification states, timestamps, and audit logs. Given the compliance-heavy domain, data integrity and traceability are non-negotiable.

Since the system will have high read traffic (checking if a user is verified), I would optimize for reads by:

  • Introducing Redis caching for frequently accessed verification status and profile summaries
  • Structuring the data model to separate write-heavy and read-heavy workloads — normalized schema for writes, denormalized views for reads
  • Adding read replicas for larger scale

I would expose the verification service via a well-defined API layer, and introduce asynchronous processing (message queues) for document verification workflows or external API calls.

Bonus line: "I would also consider event-driven architecture for tracking verification changes and ensuring consistency across services."


Follow-up: Database Schema for Auditability

"What table structures and indexing strategies would you consider essential?"

Strong answer

I would design the schema to separate operational data from audit data.

Primary table (verification_records) — current state only:

SQL
CREATE TABLE verification_records (
  id               uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id          text NOT NULL,
  status           text NOT NULL, -- PENDING | IN_PROGRESS | VERIFIED | REJECTED
  version          int NOT NULL DEFAULT 1,
  updated_at       timestamptz DEFAULT now()
);

Append-only audit table (verification_audit_log):

SQL
CREATE TABLE verification_audit_log (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  entity_id    text NOT NULL,
  action       text NOT NULL, -- created | updated | verified | rejected
  performed_by text NOT NULL,
  previous_val jsonb,
  new_val      jsonb,
  created_at   timestamptz DEFAULT now()
);

Read model (user_verification_summary) — for fast lookups:

SQL
CREATE TABLE user_verification_summary (
  user_id            text PRIMARY KEY,
  verification_status text NOT NULL,
  last_verified_at   timestamptz
);

Indexing strategy:

SQL
-- Operational queries
CREATE INDEX idx_verifications_user ON verification_records(user_id);
CREATE INDEX idx_verifications_status ON verification_records(user_id, status);

-- Audit history
CREATE INDEX idx_audit_entity_time ON verification_audit_log(entity_id, created_at DESC);

For very large audit logs, use table partitioning by date to keep queries fast.


Follow-up: Redis Caching Strategy

"How would you integrate Redis to balance low-latency reads with auditable, consistent data?"

Strong answer

Redis acts as a read-optimization layer only — the relational database remains the single source of truth for all writes and audit data.

Write flow (consistency first):

  1. Request → Backend
  2. Write to PostgreSQL (source of truth)
  3. Persist audit log (append-only)
  4. Invalidate or update cache

Cache strategy — cache-aside (lazy loading):

  • On read: check Redis → cache miss → fetch from DB → store in Redis
  • On write: update DB → invalidate relevant cache key

Cache invalidation:

Option A — invalidate on write (most common): After any update, delete the affected key. Next read repopulates from DB.

Option B — update cache directly: For simple fields (e.g., status), update Redis immediately after DB write.

TTL: Always set TTL (5–15 min) as a safety net against missed invalidations — but never rely on TTL alone for correctness.

Namespaced keys:

user:123:verification_status
user:123:profile_summary

Critical rule: Audit logs are never cached as source of truth. Redis only caches derived or read models.


Follow-up: Maintaining Consistency at High Volume

"When audit-related data changes frequently, how do you avoid stale cache?"

Strong answer

I would combine write-first consistency, targeted invalidation, and event-driven updates:

  1. Write-first — all changes go to the database first including audit log. Only after commit do I touch the cache.

  2. Targeted key invalidation — invalidate only the affected key (e.g., user:123:verification), not a global cache flush.

  3. Event-driven invalidation (for scale) — after a DB write, publish a UserVerificationUpdated event. Consumers listen and invalidate/update cache. This decouples write logic from cache management.

  4. TTL as fallback — 5–10 minutes ensures eventual consistency if an invalidation is missed.

  5. Versioning — include updated_at or version in cached objects. Only overwrite cache if the new data is newer.

  6. Never cache raw audit data — cache only aggregated views like verification_status.


Follow-up: Queue Architecture for Background Jobs

"How would you structure the queue system for high throughput with strict security?"

Strong answer

I would use an event-driven pipeline with a secure queue layer and isolated worker services.

Queue topology (separate workloads):

  • high_priority_verification — real-time identity checks
  • standard_verification — normal flow
  • low_priority_tasks — re-validation, background enrichment

Kafka → separate topics with partitions. RabbitMQ → multiple queues with routing keys.

Message structure (idempotent):

JSON
{
  "job_id": "uuid",
  "verification_id": "ref-only",
  "type": "IDENTITY_CHECK",
  "version": 1
}

Pass only references (IDs) — workers fetch sensitive data securely from DB.

Retry strategy:

fail → retry queue (1s) → retry queue (5s) → retry queue (30s) → Dead Letter Queue

Security at queue level:

  • TLS + credentials for all connections
  • ACLs / IAM roles — restrict which services can publish/consume
  • Service tokens — producers sign messages, workers validate origin

Throughput:

  • Partition by verification_id — same entity → same partition → ordered processing
  • Scale consumers horizontally
  • Stateless workers

Follow-up: Preventing Race Conditions

"How do you ensure workers don't process conflicting jobs simultaneously?"

Strong answer

1. Unique constraints + partial index (prevent duplicate active jobs):

SQL
CREATE UNIQUE INDEX unique_active_job_per_verification
ON verification_jobs (verification_id)
WHERE status IN ('PENDING', 'IN_PROGRESS');

2. Row-level locking:

SQL
SELECT * FROM verifications
WHERE id = $1
FOR UPDATE;

3. Optimistic concurrency control:

SQL
UPDATE verifications
SET status = $new_status, version = version + 1
WHERE id = $id AND version = $expected_version;
-- 0 rows affected = another worker got there first → safe retry

4. Valid state transitions only:

PENDING → IN_PROGRESS → VERIFIED / REJECTED

Enforced via transaction-level validation or DB constraints.

5. Atomic processing — each job in one transaction:

  1. Lock row (FOR UPDATE)
  2. Validate current state
  3. Update verification status
  4. Insert audit log (append-only)
  5. Commit

Workers are idempotent (keyed by job_id). At-least-once delivery is safe because DB constraints prevent duplicates.


Follow-up: CI/CD Pipeline Design

"How would you design a pipeline that supports gated releases, incident readiness, and rollback while keeping iteration fast?"

Strong answer

Environment strategy:

  • dev → rapid iteration, feature branches
  • staging → production-like validation
  • production → controlled releases

CI pipeline (every PR):

  • Linting + static analysis
  • Unit + integration tests
  • Docker image build
  • Security scan (dependencies, SAST)
  • All must pass before merge

CD pipeline (after merge to main):

  1. Auto-deploy to staging
  2. Run integration + smoke tests
  3. Manual approval (or auto checks) before production
  4. Progressive delivery: canary at 5–10% traffic → promote to 100%

Rollback options:

  • Blue-green: instant traffic switch back to previous environment
  • Redeploy versioned Docker image (app:1.2.3)
  • Feature flags: disable without redeploying

Keeping iteration fast:

  • Parallelize CI jobs
  • Cache dependencies
  • Auto-deploy to dev/staging

Follow-up: Automated Rollback During Canary

"What metrics would you monitor and how would you automate the rollback decision?"

Strong answer

Core principle: Compare canary vs stable baseline, not absolute values.

Metrics to monitor:

| Priority | Metric | Rollback trigger | |----------|--------|-----------------| | Highest | Error rate (5xx) | >2–3Ɨ baseline | | High | P95/P99 latency | >30–50% increase | | High | Business KPIs (failed logins, conversions) | Significant drop | | Medium | Throughput / success rate | Sustained drop | | Lower | CPU/memory, queue backlog | Spike over window |

Automated flow:

  1. Deploy canary (5–10% traffic)
  2. Monitor all metrics continuously
  3. Compare against baseline
  4. If thresholds breached for sustained window (2–5 min): auto-rollback or pause + alert
  5. If healthy: promote to 100%

Safety mechanisms:

  • Require multi-metric confirmation (not just one spike)
  • Time window prevents reacting to momentary noise
  • Combine system + business signals

Follow-up: Rollback Safety

"How do you validate that rollback itself doesn't introduce inconsistencies?"

Strong answer

Design rollback as a first-class path:

1. ACID transactions — no partial writes. Failed mid-way = DB rollback.

2. Backward-compatible database migrations — expand → migrate → contract pattern:

  • Expand: add new columns (safe for both app versions)
  • Migrate: write to both old + new
  • Contract: remove old columns in a later release

If rollback happens, the old app version still works with the current schema.

3. Idempotent background jobs — safe to replay. Record state transitions so partially completed work is visible and recoverable.

4. Saga pattern for multi-step workflows — if a step fails, emit compensating events to undo prior steps.

5. Test rollback in staging — don't discover rollback failures during an incident.

6. DLQ inspection — failed jobs go to dead-letter queue for manual inspection before replay.


Summary: What This Shows the Interviewer

| Dimension | What you demonstrated | |-----------|----------------------| | System thinking | Service separation, API design | | Backend depth | PostgreSQL, indexing, read replicas, transactions | | Scalability | Redis cache-aside, queue partitioning, horizontal scaling | | Compliance | Append-only audit logs, atomic writes | | Operational maturity | CI/CD gates, canary deployments, rollback-safe schema evolution | | Concurrency | Row locking, optimistic versioning, idempotency |

This is how a Senior Backend Lead or Staff Engineer answers system design questions.

Enjoyed this article?

Explore the System Design learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.