Learnixo

GenAI & LLM Interviews · Lesson 2 of 30

Interview: OpenAI, Azure OpenAI, Claude, Gemini & Model Selection

Q1: Compare OpenAI API vs Azure OpenAI for an enterprise .NET product. When is Azure the right choice?

Answer: Both expose similar chat/completions and embeddings APIs, but Azure OpenAI is the enterprise path:

| Dimension | OpenAI API | Azure OpenAI | |-----------|------------|--------------| | Data residency | OpenAI-operated regions | Your chosen Azure region (EU, US, etc.) | | Identity | API keys | Microsoft Entra ID, managed identity, Key Vault | | Compliance | Standard OpenAI terms | Microsoft enterprise agreements, HIPAA BAA options | | Networking | Public internet | Private endpoints, VNet integration | | Models | Latest first | Slight lag; stable named deployments |

When Azure wins: regulated healthcare/finance, existing Azure spend, need private endpoints, centralised secret rotation via Key Vault, unified monitoring in Application Insights.

When direct OpenAI wins: fastest access to newest models, smallest team without Azure footprint, prototypes.

Senior sound bite: "I treat Azure deployments as capacity units — gpt-4o-prod-eu is a deployment name, not a model string. I version deployments and route traffic with fallbacks."


Q2: How do Claude and Gemini fit into a multi-provider strategy?

Answer: Most production systems standardise on one primary provider and add others for specific strengths or failover.

Claude (Anthropic): Strong long-context reasoning, careful refusals, good for document analysis and policy-heavy assistants. API similar to OpenAI chat format (with provider-specific SDK).

Gemini (Google): Tight integration with Google Cloud, multimodal (image/video) native, competitive pricing on mid-tier models. Good if you already run on GCP (Vertex AI).

Multi-provider pattern:

  1. Router classifies task (complexity, modality, language)
  2. Primary handles 90% of traffic (e.g. Azure GPT-4o)
  3. Fallback on 429/5xx (cheaper model or alternate region)
  4. Specialist for niche tasks (Claude for 200k-token doc review)

Interview trap: "We use every model" — sounds immature. Strong answer: one primary, explicit exceptions documented.


Q3: Walk through model selection for these tasks: classification, RAG Q&A, code generation, and agent orchestration.

Answer:

| Task | Model tier | Why | |------|------------|-----| | Intent classification / routing | Small (GPT-4o-mini, Haiku) | Low latency, cheap, temperature 0 | | RAG customer Q&A | Mid–large (GPT-4o) | Needs grounding + nuanced refusal | | Code generation | Large reasoning model | Tool use + fewer logic errors | | Agent planner / critic | Large | Multi-step reasoning; cost justified per session |

Decision checklist:

  1. Accuracy floor — wrong answer cost (medical > marketing copy)
  2. Latency SLA — TTFT under 500ms for chat?
  3. Context size — full policy PDF in one shot?
  4. Structured output — JSON mode / tool calling support?
  5. Cost per 1k sessions — back-of-envelope with avg tokens

Example (pharmacy assistant): Route "is drug X OTC?" to RAG + GPT-4o. Route "summarise this 80-page formulary" to long-context model. Route "hello" to template response — no LLM.


Q4: What is a deployment in Azure OpenAI and how do you design for high availability?

Answer: A deployment binds a model name (e.g. gpt-4o) to provisioned throughput (TPM/RPM quotas) in a region.

HA patterns:

  • Multiple deployments in one region (scale TPM)
  • Cross-region failover — secondary deployment + circuit breaker in app code
  • Retry with backoff on 429 (rate limit) and 503
  • Queue bursty traffic (Service Bus) so spikes don't drop requests
  • Cache identical prompts (Redis) to cut duplicate spend
C#
// Pseudocode: fallback chain
var models = new[] { "gpt-4o-primary", "gpt-4o-secondary", "gpt-4o-mini-fallback" };
foreach (var deployment in models) {
    try { return await client.GetChatCompletionsAsync(deployment, options); }
    catch (RequestFailedException ex) when (ex.Status == 429 || ex.Status >= 500) { /* next */ }
}
throw new AllModelsUnavailableException();

Q5: How do you estimate and control LLM API cost in production?

Answer: Cost = (input tokens × input price) + (output tokens × output price) + embeddings + rerankers.

Controls:

  1. Model routing — mini for cheap steps, large only when needed
  2. Prompt compression — summarise history, don't resend full thread
  3. Max output tokens — cap runaway generations
  4. Caching — exact or semantic cache for FAQs
  5. Batch embeddings — offline ingestion, not per-query embed of corpus
  6. Observability — cost per session, per feature flag, per tenant

Interview metric: "We track $ / successful resolution not just $ / request — a cheap wrong answer is expensive."


Q6: What compliance questions should you answer before sending healthcare data to a hosted LLM?

Answer:

  1. Does PHI leave our boundary? If yes → BAA with provider or de-identify first
  2. What is logged? Prompts/responses in vendor logs? Disable training on customer data
  3. Retention — conversation TTL, right to erasure (GDPR Art. 17)
  4. Region — data processed only in approved geography
  5. Audit trail — who asked what, what model answered, retrieved sources

Strong architecture: Local ASR for voice → anonymise text → hosted LLM for structuring only → map tokens back for clinician UI. Never send raw audio externally.


Q7: When would you run a local model (Ollama, vLLM) instead of a hosted API?

Answer:

| Use local | Use hosted | |-----------|------------| | Strict air-gapped / no egress | Best quality (GPT-4 class) | | High volume, predictable workload | Low ops burden | | Fine-tuned proprietary small model | Rapid iteration on prompts | | Latency-sensitive edge | Multimodal at scale without GPU ops |

Tradeoff: Local Llama-class models need GPU ops, quantisation, eval harness — and often more post-processing to match GPT-4 quality.

Hybrid: Local classifier/router + hosted reasoning for hard queries.


Q8: Design model selection for an internal copilot used by 2,000 employees across Slack and a web portal.

Answer:

Requirements: SSO auth, tenant-aware RAG over SharePoint/Confluence, no training on company data, under 3s TTFT.

Architecture:

  1. Entra ID → user context + document ACL filter at retrieval time
  2. Router (mini model): search_docs | draft_email | summarize_thread | general_chat
  3. RAG with hybrid search + ACL metadata on chunks
  4. Generator GPT-4o for synthesis; mini for routing and query rewrite
  5. Guardrails — no PII in logs, DLP scan on outbound, citation required for factual claims
  6. Cost cap per department via rate limits and monthly budget alerts

Why this scores in interviews: You connected identity, retrieval security, routing, and FinOps — not just "we'd use GPT-4."