Back to blog
AI Systemsbeginner

How AI & LLMs Actually Work: A Developer's Guide

Understand what's really happening inside ChatGPT and LLMs — tokens, embeddings, attention, the transformer architecture, and how to use the OpenAI API in .NET and Python with real code examples.

LearnixoApril 14, 20268 min read
AILLMOpenAIChatGPTMachine Learning.NETPython
Share:š•

You Don't Need a PhD to Use AI

Most developers treat AI APIs as a black box — you send text in, magic comes out. That works until it doesn't: your outputs are inconsistent, your costs are higher than expected, your app hallucinates facts.

Understanding how LLMs work — even at a high level — makes you dramatically better at using them. You'll write better prompts, make smarter architectural decisions, and debug failures faster.


What Is a Large Language Model?

An LLM (Large Language Model) is a neural network trained on enormous amounts of text. Its one job: predict the next token given all previous tokens.

That's it. Everything else — code generation, reasoning, summarisation, translation — is an emergent behaviour of doing next-token prediction extremely well, on a huge dataset, with a very large model.

Input:  "The capital of France is"
Output: "Paris"   ← the most probable next token

The model doesn't "know" facts in the way a database does. It learned statistical patterns: given this sequence of tokens, these tokens are likely to follow.


Tokens: The Unit of Everything

LLMs don't process characters or words — they process tokens. A token is roughly 3–4 characters, or about ¾ of a word in English.

"Hello, world!"      → ["Hello", ",", " world", "!"]           — 4 tokens
"Unbelievable!"      → ["Un", "bel", "iev", "able", "!"]       — 5 tokens
"def calculate():"   → ["def", " calculate", "():", ]          — 3 tokens

Why this matters:

  • Cost is measured in tokens — input tokens + output tokens
  • Context window = maximum tokens the model can process at once
  • Long documents may need chunking before being sent to the API
  • Some words cost more tokens than others (technical jargon, non-English text)

GPT-4o token pricing (approximate, 2026)

| Model | Input | Output | |-------|-------|--------| | gpt-4o | $2.50 / 1M tokens | $10 / 1M tokens | | gpt-4o-mini | $0.15 / 1M tokens | $0.60 / 1M tokens | | o1 | $15 / 1M tokens | $60 / 1M tokens |

A 1,000-word document is roughly 1,300 tokens. 1,000 API calls with a 1,000-token prompt costs ~$2.50 on gpt-4o. gpt-4o-mini is 16Ɨ cheaper and handles most tasks well.


The Context Window

The context window is the maximum number of tokens the model can "see" at once — both your input and its output combined.

| Model | Context Window | |-------|---------------| | gpt-4o | 128k tokens (~96,000 words) | | gpt-4o-mini | 128k tokens | | Claude 3.7 Sonnet | 200k tokens | | Gemini 1.5 Pro | 2M tokens |

Within the context window, the model can reference anything. Beyond it, the model forgets. This is why long conversations eventually lose earlier context — the oldest messages get dropped to stay within the window.


Embeddings: Meaning as Numbers

Embeddings convert text into a vector of numbers (e.g., 1,536 numbers for text-embedding-3-small). Similar text produces similar vectors — vectors that are "close" in mathematical space.

"King"  → [0.23, -0.11, 0.87, ...]
"Queen" → [0.24, -0.10, 0.88, ...]   ← very similar vector
"Apple" → [-0.45, 0.67, -0.12, ...]  ← different direction entirely

Embeddings power:

  • Semantic search — find documents by meaning, not just keywords
  • RAG (Retrieval-Augmented Generation) — find relevant context to inject before the LLM answers
  • Recommendation systems — "similar items"
  • Clustering — group content by topic automatically

Temperature and Sampling

Temperature controls how random the output is:

Temperature = 0.0  → always picks the most probable token — deterministic, repetitive
Temperature = 0.7  → some randomness — good for most tasks  
Temperature = 1.0  → higher creativity — more varied, may drift
Temperature = 2.0  → very random — often incoherent
C#
// Low temp for factual/code tasks
var response = await client.Chat.Completions.CreateAsync(new()
{
    Model       = "gpt-4o-mini",
    Temperature = 0.1f,    // deterministic — good for code generation
    Messages    = [ ... ],
});

// Higher temp for creative writing
var story = await client.Chat.Completions.CreateAsync(new()
{
    Model       = "gpt-4o",
    Temperature = 0.9f,
    Messages    = [ ... ],
});

The Transformer Architecture (Simplified)

The "transformer" is the neural network architecture behind every major LLM. You don't need to implement one, but understanding the key ideas helps:

Self-attention: Every token looks at every other token in the context and learns which ones are relevant to it. This is how "it" in "The dog chased the cat because it was fast" correctly refers to "cat" — the model learns to attend to the right antecedent.

Layers: Transformers stack many attention layers. Each layer learns increasingly abstract representations — early layers learn syntax, later layers learn semantics and reasoning.

Pre-training + Fine-tuning: The base model is pre-trained on trillions of tokens of internet text (predicting the next token). Then it's fine-tuned using human feedback (RLHF) to be helpful, harmless, and honest.

Inference: At generation time, the model runs a forward pass through all layers for each token it generates. This is why generation is sequential (one token at a time) and why longer outputs cost more.


The AI Landscape

OpenAI         → GPT-4o, GPT-4o-mini, o1, o3 — market leader, best tooling
Anthropic      → Claude 3.7 Sonnet, Claude Opus — strongest for reasoning/code
Google         → Gemini 1.5 Pro, Gemini Flash — 2M context window
Meta           → Llama 3.x — open weights, run locally or on your own infra
Mistral        → Mistral Large, Mixtral — European, efficient models
Microsoft      → Azure OpenAI — OpenAI models hosted on Azure, compliance friendly

For most developers building on Azure:

  • Use Azure OpenAI (gpt-4o / gpt-4o-mini) — same models as OpenAI, Azure compliance, SLAs
  • Use gpt-4o-mini for high-volume tasks (it's fast and cheap)
  • Use gpt-4o or o1 for reasoning-heavy tasks

Your First API Call — .NET

Bash
dotnet add package Azure.AI.OpenAI
# or for standard OpenAI:
dotnet add package OpenAI
C#
// appsettings.json
{
  "OpenAI": {
    "ApiKey": "sk-...",
    "Model":  "gpt-4o-mini"
  }
}
C#
// Program.cs — register the client
builder.Services.AddSingleton(sp =>
{
    var config = sp.GetRequiredService<IConfiguration>();
    return new OpenAIClient(config["OpenAI:ApiKey"]);
});
C#
// Simple completion
public class AiService
{
    private readonly OpenAIClient _client;
    private readonly string _model = "gpt-4o-mini";

    public AiService(OpenAIClient client) => _client = client;

    public async Task<string> AskAsync(string question, CancellationToken ct = default)
    {
        var options = new ChatCompletionOptions
        {
            Temperature = 0.3f,
            MaxOutputTokenCount = 1024,
        };

        var messages = new List<ChatMessage>
        {
            ChatMessage.CreateSystemMessage("You are a helpful assistant for developers."),
            ChatMessage.CreateUserMessage(question),
        };

        var response = await _client
            .GetChatClient(_model)
            .CompleteChatAsync(messages, options, ct);

        return response.Value.Content[0].Text;
    }
}
C#
// Controller
[ApiController]
[Route("api/ai")]
public class AiController : ControllerBase
{
    private readonly AiService _ai;
    public AiController(AiService ai) => _ai = ai;

    [HttpPost("ask")]
    public async Task<IActionResult> Ask([FromBody] AskRequest req, CancellationToken ct)
    {
        var answer = await _ai.AskAsync(req.Question, ct);
        return Ok(new { answer });
    }
}

Streaming Responses

For chat UIs, stream the response token by token instead of waiting for the full reply:

C#
public async IAsyncEnumerable<string> StreamAsync(
    string question,
    [EnumeratorCancellation] CancellationToken ct = default)
{
    var messages = new List<ChatMessage>
    {
        ChatMessage.CreateSystemMessage("You are a helpful assistant."),
        ChatMessage.CreateUserMessage(question),
    };

    await foreach (var chunk in _client
        .GetChatClient(_model)
        .CompleteChatStreamingAsync(messages, cancellationToken: ct))
    {
        foreach (var part in chunk.ContentUpdate)
            yield return part.Text;
    }
}
C#
// Stream from a Minimal API endpoint
app.MapGet("/api/ai/stream", async (string q, AiService ai, HttpResponse response, CancellationToken ct) =>
{
    response.ContentType = "text/event-stream";
    await foreach (var token in ai.StreamAsync(q, ct))
    {
        await response.WriteAsync($"data: {token}\n\n", ct);
        await response.Body.FlushAsync(ct);
    }
});

Your First API Call — Python

Python
from openai import OpenAI

client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.3,
    messages=[
        {"role": "system", "content": "You are a helpful developer assistant."},
        {"role": "user",   "content": "Explain async/await in one paragraph."},
    ]
)

print(response.choices[0].message.content)
Python
# Streaming
for chunk in client.chat.completions.create(
    model="gpt-4o-mini",
    stream=True,
    messages=[{"role": "user", "content": "Write a haiku about Python."}]
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Token Usage and Cost Tracking

Always track your token usage in production:

C#
var response = await chatClient.CompleteChatAsync(messages, options, ct);

var usage = response.Value.Usage;
_logger.LogInformation(
    "OpenAI call — model: {Model}, input: {Input} tokens, output: {Output} tokens, total: {Total}",
    _model,
    usage.InputTokenCount,
    usage.OutputTokenCount,
    usage.TotalTokenCount);

// Store in metrics
_metrics.RecordTokenUsage(usage.InputTokenCount, usage.OutputTokenCount);

Choosing the Right Model

Task                                → Best Model
─────────────────────────────────────────────────
Simple Q&A, summarisation           → gpt-4o-mini (cheap, fast)
Complex reasoning, multi-step       → gpt-4o or o1
Code generation                     → gpt-4o or claude-3.7-sonnet
Long document analysis (100k+ words)→ Gemini 1.5 Pro (2M context)
Running locally / no API cost       → Llama 3 or Mistral via Ollama
Production on Azure (compliance)    → Azure OpenAI (gpt-4o-mini)

Rule: default to gpt-4o-mini. Only upgrade when it fails on your task.


Key Takeaways

  • LLMs predict the next token — all capabilities emerge from this trained on massive data
  • Tokens are the unit of cost and context — 1,000 words ā‰ˆ 1,300 tokens
  • Temperature controls randomness — use low values for code/facts, higher for creativity
  • Embeddings convert text to vectors — the foundation for semantic search and RAG
  • Context window is the model's working memory — beyond it, the model forgets
  • gpt-4o-mini handles 90% of tasks at 16Ɨ less cost than gpt-4o — start there
  • Stream responses in any user-facing app — nobody wants to wait 10 seconds for a reply
  • Understand your token usage — a poorly designed prompt loop can cost 100Ɨ more than it should

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.