Azure OpenAI — GPT-4o, Embeddings & Production Deployment
Complete Azure OpenAI guide — deploying models, chat completions, streaming, function calling, embeddings for RAG, content filtering, token management, .NET and Python SDK examples, and cost control.
Azure OpenAI Service provides the same GPT-4o, GPT-4, and embedding models as the OpenAI API — but with Azure's security posture: no data sent to OpenAI for training, private network access via Private Endpoints, Managed Identity authentication, Azure Monitor integration, and compliance certifications (HIPAA, SOC 2, ISO 27001).
Azure OpenAI vs OpenAI API
| | Azure OpenAI | OpenAI API | |--|-------------|-----------| | Data privacy | Your data stays in your tenant | Sent to OpenAI | | Auth | Managed Identity + Entra ID | API key only | | Network | Private Endpoints (no public internet) | Public internet only | | Compliance | HIPAA, SOC 2, ISO 27001, GDPR | Limited | | Content filtering | Configurable Azure Content Safety | Fixed | | Availability | SLA-backed | Best effort | | Model access | Requires approval; not all models available | Immediate access | | Pricing | Same token prices + Azure consumption | Direct billing |
Use Azure OpenAI when: you have data privacy requirements, need compliance certifications, or are building on Azure infrastructure.
Setup: Deploying a Model
Azure OpenAI requires a deployment — you deploy a specific model version with a deployment name, throughput (TPM — tokens per minute), and quota.
Azure OpenAI Resource
└── Deployments
├── gpt-4o (deployment name: "gpt-4o-prod")
│ model: gpt-4o (2024-11-20)
│ TPM: 100,000
│
├── gpt-4o-mini (deployment name: "gpt-4o-mini")
│ model: gpt-4o-mini
│ TPM: 500,000
│
└── text-embedding-3-large (deployment name: "embedding-prod")
model: text-embedding-3-large
TPM: 350,000# Python — using the OpenAI SDK pointed at Azure
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
# Option 1: Managed Identity (recommended for production)
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
"https://cognitiveservices.azure.com/.default"
)
client = AzureOpenAI(
azure_endpoint="https://my-resource.openai.azure.com",
azure_ad_token_provider=token_provider,
api_version="2024-10-21"
)
# Option 2: API key (development only)
client = AzureOpenAI(
azure_endpoint="https://my-resource.openai.azure.com",
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)Chat Completions
response = client.chat.completions.create(
model="gpt-4o-prod", # your deployment name, not the model name
messages=[
{"role": "system", "content": "You are a senior .NET architect. Be concise and precise."},
{"role": "user", "content": "What is the Outbox pattern and when should I use it?"}
],
temperature=0.3, # lower = more deterministic
max_tokens=500,
top_p=0.95,
frequency_penalty=0.0, # reduce repetition (0.0–2.0)
presence_penalty=0.0, # encourage new topics (0.0–2.0)
)
answer = response.choices[0].message.content
usage = response.usage
print(f"Answer: {answer}")
print(f"Tokens: {usage.prompt_tokens} prompt + {usage.completion_tokens} completion = {usage.total_tokens} total")Streaming Responses
Stream tokens as they are generated — essential for chat UIs:
# Python streaming
stream = client.chat.completions.create(
model="gpt-4o-prod",
messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)// .NET streaming
using Azure.AI.OpenAI;
using Azure.Identity;
var client = new AzureOpenAIClient(
new Uri("https://my-resource.openai.azure.com"),
new DefaultAzureCredential() // Managed Identity
);
var chatClient = client.GetChatClient("gpt-4o-prod");
await foreach (var update in chatClient.CompleteChatStreamingAsync(
new[]
{
new SystemChatMessage("You are a helpful assistant."),
new UserChatMessage("Explain the CAP theorem briefly.")
}))
{
foreach (var contentPart in update.ContentUpdate)
Console.Write(contentPart.Text);
}Function Calling (Tool Use)
Function calling lets the model call your application functions to fetch data or perform actions — the foundation of AI agents.
import json
tools = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Get the current status and estimated delivery date for an order",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID, e.g. ORD-12345"
}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "cancel_order",
"description": "Cancel an order that has not yet shipped",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"reason": {"type": "string"}
},
"required": ["order_id"]
}
}
}
]
messages = [{"role": "user", "content": "What's the status of order ORD-99123? Can I still cancel it?"}]
response = client.chat.completions.create(
model="gpt-4o-prod",
messages=messages,
tools=tools,
tool_choice="auto" # model decides whether to call a tool
)
# Check if model wants to call a function
while response.choices[0].finish_reason == "tool_calls":
tool_calls = response.choices[0].message.tool_calls
messages.append(response.choices[0].message)
for call in tool_calls:
args = json.loads(call.function.arguments)
if call.function.name == "get_order_status":
result = get_order_status(args["order_id"]) # your function
elif call.function.name == "cancel_order":
result = cancel_order(args["order_id"], args.get("reason", ""))
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result)
})
# Continue the conversation with tool results
response = client.chat.completions.create(
model="gpt-4o-prod",
messages=messages,
tools=tools
)
print(response.choices[0].message.content)Structured Output (JSON Mode)
Force the model to output valid JSON matching a schema — reliable for APIs that process the response programmatically:
from pydantic import BaseModel
from typing import List
class TicketAnalysis(BaseModel):
category: str # "billing" | "technical" | "general"
priority: str # "low" | "medium" | "high" | "critical"
summary: str
suggested_actions: List[str]
estimated_resolution_hours: int
# Using structured output with Pydantic (Python SDK 1.40+)
response = client.beta.chat.completions.parse(
model="gpt-4o-prod",
messages=[
{"role": "system", "content": "You are a customer support classifier."},
{"role": "user", "content": f"Classify this ticket: {ticket_text}"}
],
response_format=TicketAnalysis
)
analysis = response.choices[0].message.parsed
print(f"Category: {analysis.category}, Priority: {analysis.priority}")
print(f"Actions: {analysis.suggested_actions}")Embeddings for RAG
from openai import AzureOpenAI
import numpy as np
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
input=text,
model="embedding-prod" # your text-embedding-3-large deployment
)
return response.data[0].embedding
# Batch embedding (more efficient)
def get_embeddings_batch(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
input=texts, # up to 2048 texts per call
model="embedding-prod"
)
return [item.embedding for item in sorted(response.data, key=lambda x: x.index)]
# Cosine similarity
def cosine_similarity(a: list[float], b: list[float]) -> float:
a_arr, b_arr = np.array(a), np.array(b)
return np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr))Complete RAG Pattern with pgvector
import psycopg2
from openai import AzureOpenAI
def rag_answer(question: str, top_k: int = 5) -> str:
# 1. Embed the question
query_embedding = get_embedding(question)
# 2. Retrieve relevant chunks from pgvector
with psycopg2.connect(DB_CONNECTION_STRING) as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT content, source_url,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE tenant_id = %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, TENANT_ID, query_embedding, top_k))
chunks = cur.fetchall()
# 3. Build context from retrieved chunks
context = "\n\n".join([
f"[Source: {url}]\n{content}"
for content, url, sim in chunks
if sim > 0.75 # filter low-relevance chunks
])
# 4. Generate answer grounded in context
response = client.chat.completions.create(
model="gpt-4o-prod",
messages=[
{"role": "system", "content": """Answer questions using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have information about that."
Always cite the source URL."""},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0.1, # low temperature for factual responses
max_tokens=800
)
return response.choices[0].message.contentContent Filtering and Safety
Azure OpenAI has built-in content filtering via Azure AI Content Safety. Configure per-deployment:
# Content filtering is applied automatically
# Blocked content raises a ContentFilterFinishReasonError
from openai import BadRequestError
try:
response = client.chat.completions.create(
model="gpt-4o-prod",
messages=[{"role": "user", "content": user_input}]
)
except BadRequestError as e:
if "content_filter" in str(e):
# Input or output was blocked by content policy
return "I cannot help with that request."
raise
# Check content filter results on successful responses
if response.choices[0].finish_reason == "content_filter":
# Response was partially filtered
passToken Management and Cost Control
Token Counting Before Sending
import tiktoken
def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
encoding = tiktoken.encoding_for_model(model)
tokens = 0
for message in messages:
tokens += 4 # message overhead
for key, value in message.items():
tokens += len(encoding.encode(str(value)))
tokens += 2 # reply priming
return tokens
# Check cost before expensive calls
token_count = count_tokens(messages)
cost_estimate = (token_count / 1_000_000) * 5.0 # $5 per 1M input tokens (gpt-4o)
print(f"Estimated cost: ${cost_estimate:.4f}")Cost by Model (April 2026 approximate)
| Model | Input (per 1M tokens) | Output (per 1M tokens) | |-------|-----------------------|-----------------------| | gpt-4o | $2.50 | $10.00 | | gpt-4o-mini | $0.15 | $0.60 | | text-embedding-3-large | $0.13 | — | | text-embedding-3-small | $0.02 | — |
Cost patterns:
- Use
gpt-4o-minifor classification, routing, and simple extraction (95% cheaper) - Use
gpt-4oonly for complex reasoning and generation - Cache identical prompts (same system prompt + no user-variable content)
- Set
max_tokensappropriately — pay for completion tokens generated, not just allowed
Prompt Caching (Automatic)
Azure OpenAI automatically caches prompt prefixes of 1024+ tokens. Repeated system prompts in high-volume scenarios get 50% discount on cached tokens. No code change needed.
Prompt Management at Scale
For production systems, prompts are versioned assets — not hardcoded strings:
# prompts/ticket_classifier.py
SYSTEM_PROMPT_V2 = """You are a customer support ticket classifier for {company_name}.
Classify tickets into one of: billing, technical, account, general.
Assign priority: critical (data loss, security), high (blocking work), medium, low.
Respond in JSON: {{"category": "...", "priority": "...", "summary": "..."}}
"""
# Load from Azure App Configuration or Key Vault for dynamic updates
from azure.appconfiguration import AzureAppConfigurationClient
app_config = AzureAppConfigurationClient.from_connection_string(conn_str)
prompt_template = app_config.get_configuration_setting("prompt:ticket-classifier:v2").value.NET Integration Pattern
// Program.cs
builder.Services.AddAzureClients(clients =>
{
clients.AddOpenAIClient(new Uri(builder.Configuration["AzureOpenAI:Endpoint"]))
.WithCredential(new DefaultAzureCredential());
});
builder.Services.AddScoped<AiService>();
// AiService.cs
public class AiService(AzureOpenAIClient client)
{
private readonly ChatClient _chat = client.GetChatClient("gpt-4o-prod");
private readonly EmbeddingClient _embed = client.GetEmbeddingClient("embedding-prod");
public async Task<string> AnswerAsync(string question, string context, CancellationToken ct)
{
var completion = await _chat.CompleteChatAsync(
new[]
{
new SystemChatMessage(
"Answer using only the provided context. Cite sources."),
new UserChatMessage($"Context:\n{context}\n\nQuestion: {question}")
},
new ChatCompletionOptions { Temperature = 0.1f, MaxOutputTokenCount = 600 },
ct);
return completion.Value.Content[0].Text;
}
public async Task<float[]> EmbedAsync(string text, CancellationToken ct)
{
var result = await _embed.GenerateEmbeddingAsync(text, cancellationToken: ct);
return result.Value.ToFloats().ToArray();
}
}Related: Hugging Face Transformers — open-source models
Related: Building a Production RAG Pipeline
Related: Azure Cloud Integration — Azure services architecture
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.