Interview: LLM Providers & Model Selection (OpenAI, Azure, Claude, Gemini)
Senior interview Q&A on OpenAI and Azure OpenAI, Claude and Gemini basics, model selection, cost, latency, compliance, and when to use hosted vs local models.
Q1: Compare OpenAI API vs Azure OpenAI for an enterprise .NET product. When is Azure the right choice?
Answer: Both expose similar chat/completions and embeddings APIs, but Azure OpenAI is the enterprise path:
| Dimension | OpenAI API | Azure OpenAI | |-----------|------------|--------------| | Data residency | OpenAI-operated regions | Your chosen Azure region (EU, US, etc.) | | Identity | API keys | Microsoft Entra ID, managed identity, Key Vault | | Compliance | Standard OpenAI terms | Microsoft enterprise agreements, HIPAA BAA options | | Networking | Public internet | Private endpoints, VNet integration | | Models | Latest first | Slight lag; stable named deployments |
When Azure wins: regulated healthcare/finance, existing Azure spend, need private endpoints, centralised secret rotation via Key Vault, unified monitoring in Application Insights.
When direct OpenAI wins: fastest access to newest models, smallest team without Azure footprint, prototypes.
Senior sound bite: "I treat Azure deployments as capacity units β gpt-4o-prod-eu is a deployment name, not a model string. I version deployments and route traffic with fallbacks."
Q2: How do Claude and Gemini fit into a multi-provider strategy?
Answer: Most production systems standardise on one primary provider and add others for specific strengths or failover.
Claude (Anthropic): Strong long-context reasoning, careful refusals, good for document analysis and policy-heavy assistants. API similar to OpenAI chat format (with provider-specific SDK).
Gemini (Google): Tight integration with Google Cloud, multimodal (image/video) native, competitive pricing on mid-tier models. Good if you already run on GCP (Vertex AI).
Multi-provider pattern:
- Router classifies task (complexity, modality, language)
- Primary handles 90% of traffic (e.g. Azure GPT-4o)
- Fallback on 429/5xx (cheaper model or alternate region)
- Specialist for niche tasks (Claude for 200k-token doc review)
Interview trap: "We use every model" β sounds immature. Strong answer: one primary, explicit exceptions documented.
Q3: Walk through model selection for these tasks: classification, RAG Q&A, code generation, and agent orchestration.
Answer:
| Task | Model tier | Why | |------|------------|-----| | Intent classification / routing | Small (GPT-4o-mini, Haiku) | Low latency, cheap, temperature 0 | | RAG customer Q&A | Midβlarge (GPT-4o) | Needs grounding + nuanced refusal | | Code generation | Large reasoning model | Tool use + fewer logic errors | | Agent planner / critic | Large | Multi-step reasoning; cost justified per session |
Decision checklist:
- Accuracy floor β wrong answer cost (medical > marketing copy)
- Latency SLA β TTFT under 500ms for chat?
- Context size β full policy PDF in one shot?
- Structured output β JSON mode / tool calling support?
- Cost per 1k sessions β back-of-envelope with avg tokens
Example (pharmacy assistant): Route "is drug X OTC?" to RAG + GPT-4o. Route "summarise this 80-page formulary" to long-context model. Route "hello" to template response β no LLM.
Q4: What is a deployment in Azure OpenAI and how do you design for high availability?
Answer: A deployment binds a model name (e.g. gpt-4o) to provisioned throughput (TPM/RPM quotas) in a region.
HA patterns:
- Multiple deployments in one region (scale TPM)
- Cross-region failover β secondary deployment + circuit breaker in app code
- Retry with backoff on 429 (rate limit) and 503
- Queue bursty traffic (Service Bus) so spikes don't drop requests
- Cache identical prompts (Redis) to cut duplicate spend
// Pseudocode: fallback chain
var models = new[] { "gpt-4o-primary", "gpt-4o-secondary", "gpt-4o-mini-fallback" };
foreach (var deployment in models) {
try { return await client.GetChatCompletionsAsync(deployment, options); }
catch (RequestFailedException ex) when (ex.Status == 429 || ex.Status >= 500) { /* next */ }
}
throw new AllModelsUnavailableException();Q5: How do you estimate and control LLM API cost in production?
Answer: Cost = (input tokens Γ input price) + (output tokens Γ output price) + embeddings + rerankers.
Controls:
- Model routing β mini for cheap steps, large only when needed
- Prompt compression β summarise history, don't resend full thread
- Max output tokens β cap runaway generations
- Caching β exact or semantic cache for FAQs
- Batch embeddings β offline ingestion, not per-query embed of corpus
- Observability β cost per session, per feature flag, per tenant
Interview metric: "We track $ / successful resolution not just $ / request β a cheap wrong answer is expensive."
Q6: What compliance questions should you answer before sending healthcare data to a hosted LLM?
Answer:
- Does PHI leave our boundary? If yes β BAA with provider or de-identify first
- What is logged? Prompts/responses in vendor logs? Disable training on customer data
- Retention β conversation TTL, right to erasure (GDPR Art. 17)
- Region β data processed only in approved geography
- Audit trail β who asked what, what model answered, retrieved sources
Strong architecture: Local ASR for voice β anonymise text β hosted LLM for structuring only β map tokens back for clinician UI. Never send raw audio externally.
Q7: When would you run a local model (Ollama, vLLM) instead of a hosted API?
Answer:
| Use local | Use hosted | |-----------|------------| | Strict air-gapped / no egress | Best quality (GPT-4 class) | | High volume, predictable workload | Low ops burden | | Fine-tuned proprietary small model | Rapid iteration on prompts | | Latency-sensitive edge | Multimodal at scale without GPU ops |
Tradeoff: Local Llama-class models need GPU ops, quantisation, eval harness β and often more post-processing to match GPT-4 quality.
Hybrid: Local classifier/router + hosted reasoning for hard queries.
Q8: Design model selection for an internal copilot used by 2,000 employees across Slack and a web portal.
Answer:
Requirements: SSO auth, tenant-aware RAG over SharePoint/Confluence, no training on company data, under 3s TTFT.
Architecture:
- Entra ID β user context + document ACL filter at retrieval time
- Router (mini model):
search_docs | draft_email | summarize_thread | general_chat - RAG with hybrid search + ACL metadata on chunks
- Generator GPT-4o for synthesis; mini for routing and query rewrite
- Guardrails β no PII in logs, DLP scan on outbound, citation required for factual claims
- Cost cap per department via rate limits and monthly budget alerts
Why this scores in interviews: You connected identity, retrieval security, routing, and FinOps β not just "we'd use GPT-4."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.