How AI & LLMs Actually Work: A Developer's Guide

You Don't Need a PhD to Use AI

Most developers treat AI APIs as a black box — you send text in, magic comes out. That works until it doesn't: your outputs are inconsistent, your costs are higher than expected, your app hallucinates facts.

Understanding how LLMs work — even at a high level — makes you dramatically better at using them. You'll write better prompts, make smarter architectural decisions, and debug failures faster.

What Is a Large Language Model?

An LLM (Large Language Model) is a neural network trained on enormous amounts of text. Its one job: predict the next token given all previous tokens.

That's it. Everything else — code generation, reasoning, summarisation, translation — is an emergent behaviour of doing next-token prediction extremely well, on a huge dataset, with a very large model.

Input:  "The capital of France is"
Output: "Paris"   ← the most probable next token

The model doesn't "know" facts in the way a database does. It learned statistical patterns: given this sequence of tokens, these tokens are likely to follow.

Tokens: The Unit of Everything

LLMs don't process characters or words — they process tokens. A token is roughly 3–4 characters, or about ¾ of a word in English.

"Hello, world!"      → ["Hello", ",", " world", "!"]           — 4 tokens
"Unbelievable!"      → ["Un", "bel", "iev", "able", "!"]       — 5 tokens
"def calculate():"   → ["def", " calculate", "():", ]          — 3 tokens

Why this matters:

Cost is measured in tokens — input tokens + output tokens
Context window = maximum tokens the model can process at once
Long documents may need chunking before being sent to the API
Some words cost more tokens than others (technical jargon, non-English text)

GPT-4o token pricing (approximate, 2026)

| Model | Input | Output | |-------|-------|--------| | gpt-4o | $2.50 / 1M tokens | $10 / 1M tokens | | gpt-4o-mini | $0.15 / 1M tokens | $0.60 / 1M tokens | | o1 | $15 / 1M tokens | $60 / 1M tokens |

A 1,000-word document is roughly 1,300 tokens. 1,000 API calls with a 1,000-token prompt costs ~$2.50 on gpt-4o. gpt-4o-mini is 16× cheaper and handles most tasks well.

The Context Window

The context window is the maximum number of tokens the model can "see" at once — both your input and its output combined.

| Model | Context Window | |-------|---------------| | gpt-4o | 128k tokens (~96,000 words) | | gpt-4o-mini | 128k tokens | | Claude 3.7 Sonnet | 200k tokens | | Gemini 1.5 Pro | 2M tokens |

Within the context window, the model can reference anything. Beyond it, the model forgets. This is why long conversations eventually lose earlier context — the oldest messages get dropped to stay within the window.

Embeddings: Meaning as Numbers

Embeddings convert text into a vector of numbers (e.g., 1,536 numbers for text-embedding-3-small). Similar text produces similar vectors — vectors that are "close" in mathematical space.

"King"  → [0.23, -0.11, 0.87, ...]
"Queen" → [0.24, -0.10, 0.88, ...]   ← very similar vector
"Apple" → [-0.45, 0.67, -0.12, ...]  ← different direction entirely

Embeddings power:

Semantic search — find documents by meaning, not just keywords
RAG (Retrieval-Augmented Generation) — find relevant context to inject before the LLM answers
Recommendation systems — "similar items"
Clustering — group content by topic automatically

Temperature and Sampling

Temperature controls how random the output is:

Temperature = 0.0  → always picks the most probable token — deterministic, repetitive
Temperature = 0.7  → some randomness — good for most tasks  
Temperature = 1.0  → higher creativity — more varied, may drift
Temperature = 2.0  → very random — often incoherent

// Low temp for factual/code tasks
var response = await client.Chat.Completions.CreateAsync(new()
{
    Model       = "gpt-4o-mini",
    Temperature = 0.1f,    // deterministic — good for code generation
    Messages    = [ ... ],
});

// Higher temp for creative writing
var story = await client.Chat.Completions.CreateAsync(new()
{
    Model       = "gpt-4o",
    Temperature = 0.9f,
    Messages    = [ ... ],
});

The Transformer Architecture (Simplified)

The "transformer" is the neural network architecture behind every major LLM. You don't need to implement one, but understanding the key ideas helps:

Self-attention: Every token looks at every other token in the context and learns which ones are relevant to it. This is how "it" in "The dog chased the cat because it was fast" correctly refers to "cat" — the model learns to attend to the right antecedent.

Layers: Transformers stack many attention layers. Each layer learns increasingly abstract representations — early layers learn syntax, later layers learn semantics and reasoning.

Pre-training + Fine-tuning: The base model is pre-trained on trillions of tokens of internet text (predicting the next token). Then it's fine-tuned using human feedback (RLHF) to be helpful, harmless, and honest.

Inference: At generation time, the model runs a forward pass through all layers for each token it generates. This is why generation is sequential (one token at a time) and why longer outputs cost more.

The AI Landscape

OpenAI         → GPT-4o, GPT-4o-mini, o1, o3 — market leader, best tooling
Anthropic      → Claude 3.7 Sonnet, Claude Opus — strongest for reasoning/code
Google         → Gemini 1.5 Pro, Gemini Flash — 2M context window
Meta           → Llama 3.x — open weights, run locally or on your own infra
Mistral        → Mistral Large, Mixtral — European, efficient models
Microsoft      → Azure OpenAI — OpenAI models hosted on Azure, compliance friendly

For most developers building on Azure:

Use Azure OpenAI (gpt-4o / gpt-4o-mini) — same models as OpenAI, Azure compliance, SLAs
Use gpt-4o-mini for high-volume tasks (it's fast and cheap)
Use gpt-4o or o1 for reasoning-heavy tasks

Your First API Call — .NET

Bash

dotnet add package Azure.AI.OpenAI
# or for standard OpenAI:
dotnet add package OpenAI

// appsettings.json
{
  "OpenAI": {
    "ApiKey": "sk-...",
    "Model":  "gpt-4o-mini"
  }
}

// Program.cs — register the client
builder.Services.AddSingleton(sp =>
{
    var config = sp.GetRequiredService<IConfiguration>();
    return new OpenAIClient(config["OpenAI:ApiKey"]);
});

// Simple completion
public class AiService
{
    private readonly OpenAIClient _client;
    private readonly string _model = "gpt-4o-mini";

    public AiService(OpenAIClient client) => _client = client;

    public async Task<string> AskAsync(string question, CancellationToken ct = default)
    {
        var options = new ChatCompletionOptions
        {
            Temperature = 0.3f,
            MaxOutputTokenCount = 1024,
        };

        var messages = new List<ChatMessage>
        {
            ChatMessage.CreateSystemMessage("You are a helpful assistant for developers."),
            ChatMessage.CreateUserMessage(question),
        };

        var response = await _client
            .GetChatClient(_model)
            .CompleteChatAsync(messages, options, ct);

        return response.Value.Content[0].Text;
    }
}

// Controller
[ApiController]
[Route("api/ai")]
public class AiController : ControllerBase
{
    private readonly AiService _ai;
    public AiController(AiService ai) => _ai = ai;

    [HttpPost("ask")]
    public async Task<IActionResult> Ask([FromBody] AskRequest req, CancellationToken ct)
    {
        var answer = await _ai.AskAsync(req.Question, ct);
        return Ok(new { answer });
    }
}

Streaming Responses

For chat UIs, stream the response token by token instead of waiting for the full reply:

public async IAsyncEnumerable<string> StreamAsync(
    string question,
    [EnumeratorCancellation] CancellationToken ct = default)
{
    var messages = new List<ChatMessage>
    {
        ChatMessage.CreateSystemMessage("You are a helpful assistant."),
        ChatMessage.CreateUserMessage(question),
    };

    await foreach (var chunk in _client
        .GetChatClient(_model)
        .CompleteChatStreamingAsync(messages, cancellationToken: ct))
    {
        foreach (var part in chunk.ContentUpdate)
            yield return part.Text;
    }
}

// Stream from a Minimal API endpoint
app.MapGet("/api/ai/stream", async (string q, AiService ai, HttpResponse response, CancellationToken ct) =>
{
    response.ContentType = "text/event-stream";
    await foreach (var token in ai.StreamAsync(q, ct))
    {
        await response.WriteAsync($"data: {token}\n\n", ct);
        await response.Body.FlushAsync(ct);
    }
});

Your First API Call — Python

Python

from openai import OpenAI

client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.3,
    messages=[
        {"role": "system", "content": "You are a helpful developer assistant."},
        {"role": "user",   "content": "Explain async/await in one paragraph."},
    ]
)

print(response.choices[0].message.content)