Back to blog
AI Systemsintermediate

Azure OpenAI — GPT-4o, Embeddings & Production Deployment

Complete Azure OpenAI guide — deploying models, chat completions, streaming, function calling, embeddings for RAG, content filtering, token management, .NET and Python SDK examples, and cost control.

SystemForgeApril 18, 20268 min read
Azure OpenAIGPT-4oLLMEmbeddingsRAGFunction CallingAzure.NETPython
Share:𝕏

Azure OpenAI Service provides the same GPT-4o, GPT-4, and embedding models as the OpenAI API — but with Azure's security posture: no data sent to OpenAI for training, private network access via Private Endpoints, Managed Identity authentication, Azure Monitor integration, and compliance certifications (HIPAA, SOC 2, ISO 27001).


Azure OpenAI vs OpenAI API

| | Azure OpenAI | OpenAI API | |--|-------------|-----------| | Data privacy | Your data stays in your tenant | Sent to OpenAI | | Auth | Managed Identity + Entra ID | API key only | | Network | Private Endpoints (no public internet) | Public internet only | | Compliance | HIPAA, SOC 2, ISO 27001, GDPR | Limited | | Content filtering | Configurable Azure Content Safety | Fixed | | Availability | SLA-backed | Best effort | | Model access | Requires approval; not all models available | Immediate access | | Pricing | Same token prices + Azure consumption | Direct billing |

Use Azure OpenAI when: you have data privacy requirements, need compliance certifications, or are building on Azure infrastructure.


Setup: Deploying a Model

Azure OpenAI requires a deployment — you deploy a specific model version with a deployment name, throughput (TPM — tokens per minute), and quota.

Azure OpenAI Resource
  └── Deployments
        ├── gpt-4o                  (deployment name: "gpt-4o-prod")
        │     model: gpt-4o (2024-11-20)
        │     TPM: 100,000
        │
        ├── gpt-4o-mini             (deployment name: "gpt-4o-mini")
        │     model: gpt-4o-mini
        │     TPM: 500,000
        │
        └── text-embedding-3-large  (deployment name: "embedding-prod")
              model: text-embedding-3-large
              TPM: 350,000
Python
# Python  using the OpenAI SDK pointed at Azure
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Option 1: Managed Identity (recommended for production)
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)
client = AzureOpenAI(
    azure_endpoint="https://my-resource.openai.azure.com",
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21"
)

# Option 2: API key (development only)
client = AzureOpenAI(
    azure_endpoint="https://my-resource.openai.azure.com",
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

Chat Completions

Python
response = client.chat.completions.create(
    model="gpt-4o-prod",          # your deployment name, not the model name
    messages=[
        {"role": "system", "content": "You are a senior .NET architect. Be concise and precise."},
        {"role": "user",   "content": "What is the Outbox pattern and when should I use it?"}
    ],
    temperature=0.3,              # lower = more deterministic
    max_tokens=500,
    top_p=0.95,
    frequency_penalty=0.0,        # reduce repetition (0.0–2.0)
    presence_penalty=0.0,         # encourage new topics (0.0–2.0)
)

answer = response.choices[0].message.content
usage = response.usage
print(f"Answer: {answer}")
print(f"Tokens: {usage.prompt_tokens} prompt + {usage.completion_tokens} completion = {usage.total_tokens} total")

Streaming Responses

Stream tokens as they are generated — essential for chat UIs:

Python
# Python streaming
stream = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
C#
// .NET streaming
using Azure.AI.OpenAI;
using Azure.Identity;

var client = new AzureOpenAIClient(
    new Uri("https://my-resource.openai.azure.com"),
    new DefaultAzureCredential()   // Managed Identity
);
var chatClient = client.GetChatClient("gpt-4o-prod");

await foreach (var update in chatClient.CompleteChatStreamingAsync(
    new[]
    {
        new SystemChatMessage("You are a helpful assistant."),
        new UserChatMessage("Explain the CAP theorem briefly.")
    }))
{
    foreach (var contentPart in update.ContentUpdate)
        Console.Write(contentPart.Text);
}

Function Calling (Tool Use)

Function calling lets the model call your application functions to fetch data or perform actions — the foundation of AI agents.

Python
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Get the current status and estimated delivery date for an order",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The order ID, e.g. ORD-12345"
                    }
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "cancel_order",
            "description": "Cancel an order that has not yet shipped",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string"},
                    "reason":   {"type": "string"}
                },
                "required": ["order_id"]
            }
        }
    }
]

messages = [{"role": "user", "content": "What's the status of order ORD-99123? Can I still cancel it?"}]

response = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=messages,
    tools=tools,
    tool_choice="auto"   # model decides whether to call a tool
)

# Check if model wants to call a function
while response.choices[0].finish_reason == "tool_calls":
    tool_calls = response.choices[0].message.tool_calls
    messages.append(response.choices[0].message)

    for call in tool_calls:
        args = json.loads(call.function.arguments)

        if call.function.name == "get_order_status":
            result = get_order_status(args["order_id"])      # your function
        elif call.function.name == "cancel_order":
            result = cancel_order(args["order_id"], args.get("reason", ""))

        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result)
        })

    # Continue the conversation with tool results
    response = client.chat.completions.create(
        model="gpt-4o-prod",
        messages=messages,
        tools=tools
    )

print(response.choices[0].message.content)

Structured Output (JSON Mode)

Force the model to output valid JSON matching a schema — reliable for APIs that process the response programmatically:

Python
from pydantic import BaseModel
from typing import List

class TicketAnalysis(BaseModel):
    category: str          # "billing" | "technical" | "general"
    priority: str          # "low" | "medium" | "high" | "critical"
    summary: str
    suggested_actions: List[str]
    estimated_resolution_hours: int

# Using structured output with Pydantic (Python SDK 1.40+)
response = client.beta.chat.completions.parse(
    model="gpt-4o-prod",
    messages=[
        {"role": "system", "content": "You are a customer support classifier."},
        {"role": "user",   "content": f"Classify this ticket: {ticket_text}"}
    ],
    response_format=TicketAnalysis
)

analysis = response.choices[0].message.parsed
print(f"Category: {analysis.category}, Priority: {analysis.priority}")
print(f"Actions: {analysis.suggested_actions}")

Embeddings for RAG

Python
from openai import AzureOpenAI
import numpy as np

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model="embedding-prod"     # your text-embedding-3-large deployment
    )
    return response.data[0].embedding

# Batch embedding (more efficient)
def get_embeddings_batch(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        input=texts,               # up to 2048 texts per call
        model="embedding-prod"
    )
    return [item.embedding for item in sorted(response.data, key=lambda x: x.index)]

# Cosine similarity
def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr, b_arr = np.array(a), np.array(b)
    return np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr))

Complete RAG Pattern with pgvector

Python
import psycopg2
from openai import AzureOpenAI

def rag_answer(question: str, top_k: int = 5) -> str:
    # 1. Embed the question
    query_embedding = get_embedding(question)

    # 2. Retrieve relevant chunks from pgvector
    with psycopg2.connect(DB_CONNECTION_STRING) as conn:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT content, source_url,
                       1 - (embedding <=> %s::vector) AS similarity
                FROM documents
                WHERE tenant_id = %s
                ORDER BY embedding <=> %s::vector
                LIMIT %s
            """, (query_embedding, TENANT_ID, query_embedding, top_k))
            chunks = cur.fetchall()

    # 3. Build context from retrieved chunks
    context = "\n\n".join([
        f"[Source: {url}]\n{content}"
        for content, url, sim in chunks
        if sim > 0.75   # filter low-relevance chunks
    ])

    # 4. Generate answer grounded in context
    response = client.chat.completions.create(
        model="gpt-4o-prod",
        messages=[
            {"role": "system", "content": """Answer questions using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have information about that."
Always cite the source URL."""},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.1,     # low temperature for factual responses
        max_tokens=800
    )

    return response.choices[0].message.content

Content Filtering and Safety

Azure OpenAI has built-in content filtering via Azure AI Content Safety. Configure per-deployment:

Python
# Content filtering is applied automatically
# Blocked content raises a ContentFilterFinishReasonError
from openai import BadRequestError

try:
    response = client.chat.completions.create(
        model="gpt-4o-prod",
        messages=[{"role": "user", "content": user_input}]
    )
except BadRequestError as e:
    if "content_filter" in str(e):
        # Input or output was blocked by content policy
        return "I cannot help with that request."
    raise

# Check content filter results on successful responses
if response.choices[0].finish_reason == "content_filter":
    # Response was partially filtered
    pass

Token Management and Cost Control

Token Counting Before Sending

Python
import tiktoken

def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    encoding = tiktoken.encoding_for_model(model)
    tokens = 0
    for message in messages:
        tokens += 4  # message overhead
        for key, value in message.items():
            tokens += len(encoding.encode(str(value)))
    tokens += 2  # reply priming
    return tokens

# Check cost before expensive calls
token_count = count_tokens(messages)
cost_estimate = (token_count / 1_000_000) * 5.0  # $5 per 1M input tokens (gpt-4o)
print(f"Estimated cost: ${cost_estimate:.4f}")

Cost by Model (April 2026 approximate)

| Model | Input (per 1M tokens) | Output (per 1M tokens) | |-------|-----------------------|-----------------------| | gpt-4o | $2.50 | $10.00 | | gpt-4o-mini | $0.15 | $0.60 | | text-embedding-3-large | $0.13 | — | | text-embedding-3-small | $0.02 | — |

Cost patterns:

  • Use gpt-4o-mini for classification, routing, and simple extraction (95% cheaper)
  • Use gpt-4o only for complex reasoning and generation
  • Cache identical prompts (same system prompt + no user-variable content)
  • Set max_tokens appropriately — pay for completion tokens generated, not just allowed

Prompt Caching (Automatic)

Azure OpenAI automatically caches prompt prefixes of 1024+ tokens. Repeated system prompts in high-volume scenarios get 50% discount on cached tokens. No code change needed.


Prompt Management at Scale

For production systems, prompts are versioned assets — not hardcoded strings:

Python
# prompts/ticket_classifier.py
SYSTEM_PROMPT_V2 = """You are a customer support ticket classifier for {company_name}.

Classify tickets into one of: billing, technical, account, general.
Assign priority: critical (data loss, security), high (blocking work), medium, low.

Respond in JSON: {{"category": "...", "priority": "...", "summary": "..."}}
"""

# Load from Azure App Configuration or Key Vault for dynamic updates
from azure.appconfiguration import AzureAppConfigurationClient
app_config = AzureAppConfigurationClient.from_connection_string(conn_str)
prompt_template = app_config.get_configuration_setting("prompt:ticket-classifier:v2").value

.NET Integration Pattern

C#
// Program.cs
builder.Services.AddAzureClients(clients =>
{
    clients.AddOpenAIClient(new Uri(builder.Configuration["AzureOpenAI:Endpoint"]))
           .WithCredential(new DefaultAzureCredential());
});
builder.Services.AddScoped<AiService>();

// AiService.cs
public class AiService(AzureOpenAIClient client)
{
    private readonly ChatClient _chat = client.GetChatClient("gpt-4o-prod");
    private readonly EmbeddingClient _embed = client.GetEmbeddingClient("embedding-prod");

    public async Task<string> AnswerAsync(string question, string context, CancellationToken ct)
    {
        var completion = await _chat.CompleteChatAsync(
            new[]
            {
                new SystemChatMessage(
                    "Answer using only the provided context. Cite sources."),
                new UserChatMessage($"Context:\n{context}\n\nQuestion: {question}")
            },
            new ChatCompletionOptions { Temperature = 0.1f, MaxOutputTokenCount = 600 },
            ct);

        return completion.Value.Content[0].Text;
    }

    public async Task<float[]> EmbedAsync(string text, CancellationToken ct)
    {
        var result = await _embed.GenerateEmbeddingAsync(text, cancellationToken: ct);
        return result.Value.ToFloats().ToArray();
    }
}

Related: Hugging Face Transformers — open-source models
Related: Building a Production RAG Pipeline
Related: Azure Cloud Integration — Azure services architecture

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.