Senior System Design Interview: Identity Verification System
Full senior/staff-level interview walkthrough ā design an identity verification system with high read traffic, auditability, Redis caching, queue architecture, CI/CD pipeline, and rollback strategy. Real answers you can use.
This is a real senior-level system design interview walkthrough. Each section is a follow-up question an interviewer might ask ā answered at staff/lead level. Read it like a conversation.
The Opening Question
"Design a system for identity verification with high read traffic. How would you approach this?"
What the interviewer is really asking
They want to see if you understand:
- System design ā service separation, API design
- Data modeling ā schemas, consistency
- Scalability ā caching, indexing, read replicas
- Compliance awareness ā auditability, traceability
Strong answer
I would start by separating the system into a dedicated Verification Service responsible for handling identity validation, storing verification results, and exposing APIs for other parts of the system.
For the data layer, I would use a relational database (PostgreSQL) to store structured verification data ā user identity attributes, verification states, timestamps, and audit logs. Given the compliance-heavy domain, data integrity and traceability are non-negotiable.
Since the system will have high read traffic (checking if a user is verified), I would optimize for reads by:
- Introducing Redis caching for frequently accessed verification status and profile summaries
- Structuring the data model to separate write-heavy and read-heavy workloads ā normalized schema for writes, denormalized views for reads
- Adding read replicas for larger scale
I would expose the verification service via a well-defined API layer, and introduce asynchronous processing (message queues) for document verification workflows or external API calls.
Bonus line: "I would also consider event-driven architecture for tracking verification changes and ensuring consistency across services."
Follow-up: Database Schema for Auditability
"What table structures and indexing strategies would you consider essential?"
Strong answer
I would design the schema to separate operational data from audit data.
Primary table (verification_records) ā current state only:
CREATE TABLE verification_records (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
user_id text NOT NULL,
status text NOT NULL, -- PENDING | IN_PROGRESS | VERIFIED | REJECTED
version int NOT NULL DEFAULT 1,
updated_at timestamptz DEFAULT now()
);Append-only audit table (verification_audit_log):
CREATE TABLE verification_audit_log (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
entity_id text NOT NULL,
action text NOT NULL, -- created | updated | verified | rejected
performed_by text NOT NULL,
previous_val jsonb,
new_val jsonb,
created_at timestamptz DEFAULT now()
);Read model (user_verification_summary) ā for fast lookups:
CREATE TABLE user_verification_summary (
user_id text PRIMARY KEY,
verification_status text NOT NULL,
last_verified_at timestamptz
);Indexing strategy:
-- Operational queries
CREATE INDEX idx_verifications_user ON verification_records(user_id);
CREATE INDEX idx_verifications_status ON verification_records(user_id, status);
-- Audit history
CREATE INDEX idx_audit_entity_time ON verification_audit_log(entity_id, created_at DESC);For very large audit logs, use table partitioning by date to keep queries fast.
Follow-up: Redis Caching Strategy
"How would you integrate Redis to balance low-latency reads with auditable, consistent data?"
Strong answer
Redis acts as a read-optimization layer only ā the relational database remains the single source of truth for all writes and audit data.
Write flow (consistency first):
- Request ā Backend
- Write to PostgreSQL (source of truth)
- Persist audit log (append-only)
- Invalidate or update cache
Cache strategy ā cache-aside (lazy loading):
- On read: check Redis ā cache miss ā fetch from DB ā store in Redis
- On write: update DB ā invalidate relevant cache key
Cache invalidation:
Option A ā invalidate on write (most common): After any update, delete the affected key. Next read repopulates from DB.
Option B ā update cache directly: For simple fields (e.g., status), update Redis immediately after DB write.
TTL: Always set TTL (5ā15 min) as a safety net against missed invalidations ā but never rely on TTL alone for correctness.
Namespaced keys:
user:123:verification_status
user:123:profile_summaryCritical rule: Audit logs are never cached as source of truth. Redis only caches derived or read models.
Follow-up: Maintaining Consistency at High Volume
"When audit-related data changes frequently, how do you avoid stale cache?"
Strong answer
I would combine write-first consistency, targeted invalidation, and event-driven updates:
-
Write-first ā all changes go to the database first including audit log. Only after commit do I touch the cache.
-
Targeted key invalidation ā invalidate only the affected key (e.g.,
user:123:verification), not a global cache flush. -
Event-driven invalidation (for scale) ā after a DB write, publish a
UserVerificationUpdatedevent. Consumers listen and invalidate/update cache. This decouples write logic from cache management. -
TTL as fallback ā 5ā10 minutes ensures eventual consistency if an invalidation is missed.
-
Versioning ā include
updated_atorversionin cached objects. Only overwrite cache if the new data is newer. -
Never cache raw audit data ā cache only aggregated views like
verification_status.
Follow-up: Queue Architecture for Background Jobs
"How would you structure the queue system for high throughput with strict security?"
Strong answer
I would use an event-driven pipeline with a secure queue layer and isolated worker services.
Queue topology (separate workloads):
high_priority_verificationā real-time identity checksstandard_verificationā normal flowlow_priority_tasksā re-validation, background enrichment
Kafka ā separate topics with partitions. RabbitMQ ā multiple queues with routing keys.
Message structure (idempotent):
{
"job_id": "uuid",
"verification_id": "ref-only",
"type": "IDENTITY_CHECK",
"version": 1
}Pass only references (IDs) ā workers fetch sensitive data securely from DB.
Retry strategy:
fail ā retry queue (1s) ā retry queue (5s) ā retry queue (30s) ā Dead Letter QueueSecurity at queue level:
- TLS + credentials for all connections
- ACLs / IAM roles ā restrict which services can publish/consume
- Service tokens ā producers sign messages, workers validate origin
Throughput:
- Partition by
verification_idā same entity ā same partition ā ordered processing - Scale consumers horizontally
- Stateless workers
Follow-up: Preventing Race Conditions
"How do you ensure workers don't process conflicting jobs simultaneously?"
Strong answer
1. Unique constraints + partial index (prevent duplicate active jobs):
CREATE UNIQUE INDEX unique_active_job_per_verification
ON verification_jobs (verification_id)
WHERE status IN ('PENDING', 'IN_PROGRESS');2. Row-level locking:
SELECT * FROM verifications
WHERE id = $1
FOR UPDATE;3. Optimistic concurrency control:
UPDATE verifications
SET status = $new_status, version = version + 1
WHERE id = $id AND version = $expected_version;
-- 0 rows affected = another worker got there first ā safe retry4. Valid state transitions only:
PENDING ā IN_PROGRESS ā VERIFIED / REJECTEDEnforced via transaction-level validation or DB constraints.
5. Atomic processing ā each job in one transaction:
- Lock row (
FOR UPDATE) - Validate current state
- Update verification status
- Insert audit log (append-only)
- Commit
Workers are idempotent (keyed by job_id). At-least-once delivery is safe because DB constraints prevent duplicates.
Follow-up: CI/CD Pipeline Design
"How would you design a pipeline that supports gated releases, incident readiness, and rollback while keeping iteration fast?"
Strong answer
Environment strategy:
devā rapid iteration, feature branchesstagingā production-like validationproductionā controlled releases
CI pipeline (every PR):
- Linting + static analysis
- Unit + integration tests
- Docker image build
- Security scan (dependencies, SAST)
- All must pass before merge
CD pipeline (after merge to main):
- Auto-deploy to staging
- Run integration + smoke tests
- Manual approval (or auto checks) before production
- Progressive delivery: canary at 5ā10% traffic ā promote to 100%
Rollback options:
- Blue-green: instant traffic switch back to previous environment
- Redeploy versioned Docker image (
app:1.2.3) - Feature flags: disable without redeploying
Keeping iteration fast:
- Parallelize CI jobs
- Cache dependencies
- Auto-deploy to dev/staging
Follow-up: Automated Rollback During Canary
"What metrics would you monitor and how would you automate the rollback decision?"
Strong answer
Core principle: Compare canary vs stable baseline, not absolute values.
Metrics to monitor:
| Priority | Metric | Rollback trigger | |----------|--------|-----------------| | Highest | Error rate (5xx) | >2ā3Ć baseline | | High | P95/P99 latency | >30ā50% increase | | High | Business KPIs (failed logins, conversions) | Significant drop | | Medium | Throughput / success rate | Sustained drop | | Lower | CPU/memory, queue backlog | Spike over window |
Automated flow:
- Deploy canary (5ā10% traffic)
- Monitor all metrics continuously
- Compare against baseline
- If thresholds breached for sustained window (2ā5 min): auto-rollback or pause + alert
- If healthy: promote to 100%
Safety mechanisms:
- Require multi-metric confirmation (not just one spike)
- Time window prevents reacting to momentary noise
- Combine system + business signals
Follow-up: Rollback Safety
"How do you validate that rollback itself doesn't introduce inconsistencies?"
Strong answer
Design rollback as a first-class path:
1. ACID transactions ā no partial writes. Failed mid-way = DB rollback.
2. Backward-compatible database migrations ā expand ā migrate ā contract pattern:
- Expand: add new columns (safe for both app versions)
- Migrate: write to both old + new
- Contract: remove old columns in a later release
If rollback happens, the old app version still works with the current schema.
3. Idempotent background jobs ā safe to replay. Record state transitions so partially completed work is visible and recoverable.
4. Saga pattern for multi-step workflows ā if a step fails, emit compensating events to undo prior steps.
5. Test rollback in staging ā don't discover rollback failures during an incident.
6. DLQ inspection ā failed jobs go to dead-letter queue for manual inspection before replay.
Summary: What This Shows the Interviewer
| Dimension | What you demonstrated | |-----------|----------------------| | System thinking | Service separation, API design | | Backend depth | PostgreSQL, indexing, read replicas, transactions | | Scalability | Redis cache-aside, queue partitioning, horizontal scaling | | Compliance | Append-only audit logs, atomic writes | | Operational maturity | CI/CD gates, canary deployments, rollback-safe schema evolution | | Concurrency | Row locking, optimistic versioning, idempotency |
This is how a Senior Backend Lead or Staff Engineer answers system design questions.
Enjoyed this article?
Explore the System Design learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.