Interview: LLM Providers & Model Selection (OpenAI, Azure, Claude, Gemini)

Q1: Compare OpenAI API vs Azure OpenAI for an enterprise .NET product. When is Azure the right choice?

Answer: Both expose similar chat/completions and embeddings APIs, but Azure OpenAI is the enterprise path:

| Dimension | OpenAI API | Azure OpenAI | |-----------|------------|--------------| | Data residency | OpenAI-operated regions | Your chosen Azure region (EU, US, etc.) | | Identity | API keys | Microsoft Entra ID, managed identity, Key Vault | | Compliance | Standard OpenAI terms | Microsoft enterprise agreements, HIPAA BAA options | | Networking | Public internet | Private endpoints, VNet integration | | Models | Latest first | Slight lag; stable named deployments |

When Azure wins: regulated healthcare/finance, existing Azure spend, need private endpoints, centralised secret rotation via Key Vault, unified monitoring in Application Insights.

When direct OpenAI wins: fastest access to newest models, smallest team without Azure footprint, prototypes.

Senior sound bite: "I treat Azure deployments as capacity units — gpt-4o-prod-eu is a deployment name, not a model string. I version deployments and route traffic with fallbacks."

Q2: How do Claude and Gemini fit into a multi-provider strategy?

Answer: Most production systems standardise on one primary provider and add others for specific strengths or failover.

Claude (Anthropic): Strong long-context reasoning, careful refusals, good for document analysis and policy-heavy assistants. API similar to OpenAI chat format (with provider-specific SDK).

Gemini (Google): Tight integration with Google Cloud, multimodal (image/video) native, competitive pricing on mid-tier models. Good if you already run on GCP (Vertex AI).

Multi-provider pattern:

Router classifies task (complexity, modality, language)
Primary handles 90% of traffic (e.g. Azure GPT-4o)
Fallback on 429/5xx (cheaper model or alternate region)
Specialist for niche tasks (Claude for 200k-token doc review)

Interview trap: "We use every model" — sounds immature. Strong answer: one primary, explicit exceptions documented.

Q3: Walk through model selection for these tasks: classification, RAG Q&A, code generation, and agent orchestration.

Answer:

| Task | Model tier | Why | |------|------------|-----| | Intent classification / routing | Small (GPT-4o-mini, Haiku) | Low latency, cheap, temperature 0 | | RAG customer Q&A | Mid–large (GPT-4o) | Needs grounding + nuanced refusal | | Code generation | Large reasoning model | Tool use + fewer logic errors | | Agent planner / critic | Large | Multi-step reasoning; cost justified per session |

Decision checklist:

Accuracy floor — wrong answer cost (medical > marketing copy)
Latency SLA — TTFT under 500ms for chat?
Context size — full policy PDF in one shot?
Structured output — JSON mode / tool calling support?
Cost per 1k sessions — back-of-envelope with avg tokens

Example (pharmacy assistant): Route "is drug X OTC?" to RAG + GPT-4o. Route "summarise this 80-page formulary" to long-context model. Route "hello" to template response — no LLM.

Q4: What is a deployment in Azure OpenAI and how do you design for high availability?

Answer: A deployment binds a model name (e.g. gpt-4o) to provisioned throughput (TPM/RPM quotas) in a region.

HA patterns:

Multiple deployments in one region (scale TPM)
Cross-region failover — secondary deployment + circuit breaker in app code
Retry with backoff on 429 (rate limit) and 503
Queue bursty traffic (Service Bus) so spikes don't drop requests
Cache identical prompts (Redis) to cut duplicate spend

// Pseudocode: fallback chain
var models = new[] { "gpt-4o-primary", "gpt-4o-secondary", "gpt-4o-mini-fallback" };
foreach (var deployment in models) {
    try { return await client.GetChatCompletionsAsync(deployment, options); }
    catch (RequestFailedException ex) when (ex.Status == 429 || ex.Status >= 500) { /* next */ }
}
throw new AllModelsUnavailableException();

Q5: How do you estimate and control LLM API cost in production?

Answer: Cost = (input tokens × input price) + (output tokens × output price) + embeddings + rerankers.

Controls:

Model routing — mini for cheap steps, large only when needed
Prompt compression — summarise history, don't resend full thread
Max output tokens — cap runaway generations
Caching — exact or semantic cache for FAQs
Batch embeddings — offline ingestion, not per-query embed of corpus
Observability — cost per session, per feature flag, per tenant

Interview metric: "We track $ / successful resolution not just $ / request — a cheap wrong answer is expensive."

Q6: What compliance questions should you answer before sending healthcare data to a hosted LLM?

Answer:

Does PHI leave our boundary? If yes → BAA with provider or de-identify first
What is logged? Prompts/responses in vendor logs? Disable training on customer data
Retention — conversation TTL, right to erasure (GDPR Art. 17)
Region — data processed only in approved geography
Audit trail — who asked what, what model answered, retrieved sources

Strong architecture: Local ASR for voice → anonymise text → hosted LLM for structuring only → map tokens back for clinician UI. Never send raw audio externally.

Q7: When would you run a local model (Ollama, vLLM) instead of a hosted API?

Answer:

| Use local | Use hosted | |-----------|------------| | Strict air-gapped / no egress | Best quality (GPT-4 class) | | High volume, predictable workload | Low ops burden | | Fine-tuned proprietary small model | Rapid iteration on prompts | | Latency-sensitive edge | Multimodal at scale without GPU ops |

Tradeoff: Local Llama-class models need GPU ops, quantisation, eval harness — and often more post-processing to match GPT-4 quality.

Hybrid: Local classifier/router + hosted reasoning for hard queries.

Q8: Design model selection for an internal copilot used by 2,000 employees across Slack and a web portal.

Answer:

Requirements: SSO auth, tenant-aware RAG over SharePoint/Confluence, no training on company data, under 3s TTFT.

Architecture:

Entra ID → user context + document ACL filter at retrieval time
Router (mini model): search_docs | draft_email | summarize_thread | general_chat
RAG with hybrid search + ACL metadata on chunks
Generator GPT-4o for synthesis; mini for routing and query rewrite
Guardrails — no PII in logs, DLP scan on outbound, citation required for factual claims
Cost cap per department via rate limits and monthly budget alerts

Why this scores in interviews: You connected identity, retrieval security, routing, and FinOps — not just "we'd use GPT-4."

Interview: LLM Providers & Model Selection (OpenAI, Azure, Claude, Gemini)

Q1: Compare OpenAI API vs Azure OpenAI for an enterprise .NET product. When is Azure the right choice?

Q2: How do Claude and Gemini fit into a multi-provider strategy?

Q3: Walk through model selection for these tasks: classification, RAG Q&A, code generation, and agent orchestration.

Q4: What is a deployment in Azure OpenAI and how do you design for high availability?

Q5: How do you estimate and control LLM API cost in production?

Q6: What compliance questions should you answer before sending healthcare data to a hosted LLM?

Q7: When would you run a local model (Ollama, vLLM) instead of a hosted API?

Q8: Design model selection for an internal copilot used by 2,000 employees across Slack and a web portal.

Enjoyed this article?

Leave a comment