Ollama: Run Powerful AI Models Locally — No API Keys, No Cost

Why Local AI Is Having a Moment

For two years, "using AI" meant calling an API, paying per token, and sending your data to a cloud provider. That's still the right choice for many production use cases — but a serious alternative has quietly matured.

Ollama makes running large language models locally as simple as ollama run llama3. One command, no API keys, no billing, no data leaving your machine.

For developers building internal tools, prototyping AI features, or working with sensitive data, local models have crossed the threshold from "interesting experiment" to "genuinely viable."

What Ollama Is

Ollama is an open-source runtime that manages downloading, running, and serving local LLMs. It handles:

Model downloads and caching from the Ollama model library
GPU/CPU inference with automatic hardware detection
An OpenAI-compatible REST API (so your existing code often works unchanged)
Model versioning and multiple concurrent models

Under the hood it uses llama.cpp for CPU/Metal inference and native CUDA for NVIDIA GPUs.

Installation

Bash

# macOS / Linux — single command
curl -fsSL https://ollama.com/install.sh | sh

# Windows — download from https://ollama.com/download
# Or via winget:
winget install Ollama.Ollama

# Verify
ollama --version

Running Your First Model

Bash

# Pull and run Llama 3.2 (3B — fast, low RAM)
ollama run llama3.2

# Pull without running
ollama pull llama3.2

# Run in the terminal
>>> What is the difference between a process and a thread?

That's it. The model downloads once (~2GB) and runs locally from that point forward.

The Model Library

Ollama's library covers every major open-source model family. Here's a practical guide:

| Model | Size | RAM Needed | Best For | |---|---|---|---| | llama3.2:3b | 2GB | 4GB | Fast, everyday tasks | | llama3.2:11b | 7GB | 12GB | Balanced quality/speed | | llama3.1:70b | 40GB | 64GB | Near GPT-4 quality | | mistral:7b | 4GB | 8GB | Instruction following | | gemma3:4b | 3GB | 6GB | Google's small model | | phi4:14b | 8GB | 16GB | Microsoft, great at reasoning | | codellama:13b | 8GB | 16GB | Code generation | | deepseek-r1:8b | 5GB | 8GB | Reasoning (shows thinking) | | qwen2.5-coder:7b | 4GB | 8GB | Code-specific tasks | | nomic-embed-text | 274MB | 1GB | Embeddings/RAG |

Bash

# List all available models
ollama list

# Pull specific version
ollama pull mistral:7b-instruct-v0.3

# Remove a model
ollama rm codellama:13b

The API

Once Ollama is running, it exposes a local REST API on http://localhost:11434.

OpenAI-Compatible Endpoint

Bash

# Drop-in replacement for OpenAI — change the base URL, keep your code
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      { "role": "system",  "content": "You are a senior .NET engineer." },
      { "role": "user",    "content": "Explain the difference between IEnumerable and IQueryable." }
    ],
    "stream": false
  }'

Native Ollama API

Bash

# Generate
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"Write a SQL query that...","stream":false}'

# Chat
curl http://localhost:11434/api/chat \
  -d '{
    "model": "mistral",
    "messages": [{"role":"user","content":"Review this code: ..."}]
  }'

# Embeddings
curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text","input":"Customer churned after 3 months"}'

Using Ollama in .NET

The OpenAI-compatible endpoint means you can use the official Azure.AI.OpenAI or Anthropic SDK with a base URL swap — or use the OllamaSharp library for native Ollama features.

// Option 1: OpenAI SDK pointing at local Ollama
using OpenAI;
using OpenAI.Chat;

var client = new ChatClient(
    model: "llama3.2",
    credential: new ApiKeyCredential("ollama"),  // any string works
    options: new OpenAIClientOptions {
        Endpoint = new Uri("http://localhost:11434/v1")
    }
);

var response = await client.CompleteChatAsync(
    new UserChatMessage("Explain async/await in C# in 3 sentences.")
);
Console.WriteLine(response.Value.Content[0].Text);

// Option 2: OllamaSharp (native, full feature set)
using OllamaSharp;
using OllamaSharp.Models.Chat;

var ollama = new OllamaApiClient("http://localhost:11434");
ollama.SelectedModel = "llama3.2";

// Streaming response
await foreach (var token in ollama.GenerateAsync("Explain CQRS pattern"))
    Console.Write(token?.Response);

// Chat with history
var chat = new Chat(ollama);
var history = new List<Message>
{
    new() { Role = ChatRole.System, Content = "You are a .NET expert." }
};

while (true)
{
    Console.Write("You: ");
    var input = Console.ReadLine() ?? "";
    history.Add(new() { Role = ChatRole.User, Content = input });

    Console.Write("Assistant: ");
    var reply = "";
    await foreach (var token in chat.SendAsync(history))
    {
        Console.Write(token);
        reply += token;
    }
    Console.WriteLine();
    history.Add(new() { Role = ChatRole.Assistant, Content = reply });
}

Using Ollama in Python

Python

# pip install ollama
import ollama

# Simple generation
response = ollama.generate(model='llama3.2', prompt='Explain Redis sorted sets')
print(response['response'])

# Chat
response = ollama.chat(
    model='mistral',
    messages=[
        {'role': 'system',  'content': 'You are a database expert.'},
        {'role': 'user',    'content': 'When should I use MongoDB over PostgreSQL?'},
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.generate(model='llama3.2', prompt='List 5 Python tips', stream=True):
    print(chunk['response'], end='', flush=True)

# Embeddings — for RAG pipelines
embed = ollama.embed(model='nomic-embed-text', input='Your document text here')
vector = embed['embeddings'][0]   # list of 768 floats

Building a Local RAG Pipeline

Python

import ollama
import numpy as np
from pathlib import Path

# Simple in-memory vector store (use ChromaDB or pgvector in production)
class LocalVectorStore:
    def __init__(self):
        self.docs: list[str] = []
        self.embeddings: list[list[float]] = []

    def add(self, text: str):
        result = ollama.embed(model='nomic-embed-text', input=text)
        self.embeddings.append(result['embeddings'][0])
        self.docs.append(text)

    def search(self, query: str, top_k: int = 3) -> list[str]:
        q_embed = ollama.embed(model='nomic-embed-text', input=query)['embeddings'][0]
        scores = [
            np.dot(q_embed, doc_e) / (np.linalg.norm(q_embed) * np.linalg.norm(doc_e))
            for doc_e in self.embeddings
        ]
        top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
        return [self.docs[i] for i in top]


# Index your documents
store = LocalVectorStore()
for path in Path("docs/").glob("*.txt"):
    store.add(path.read_text())

# Query with context
def ask(question: str) -> str:
    context = "\n\n".join(store.search(question))
    prompt = f"""Answer the question using only the context below.
If the answer isn't in the context, say "I don't know."

Context:
{context}

Question: {question}"""

    response = ollama.generate(model='llama3.2', prompt=prompt)
    return response['response']

print(ask("What is our refund policy?"))

Modelfiles: Customise Any Model

DOCKERFILE

# Create a specialized assistant
# Save as Modelfile

FROM llama3.2

SYSTEM """
You are a senior .NET architect with 15 years of experience.
You answer questions concisely, with code examples when relevant.
You prefer Clean Architecture and always consider testability.
When reviewing code, you identify performance issues first.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192

Bash

# Build your custom model
ollama create dotnet-expert -f Modelfile

# Use it
ollama run dotnet-expert

Local vs Cloud: When to Use Each

| Scenario | Local (Ollama) | Cloud (OpenAI/Claude) | |---|---|---| | Sensitive data | ✅ Data never leaves | ❌ Data sent to provider | | High volume | ✅ No per-token cost | ❌ Can get expensive | | Best quality | ❌ Smaller models | ✅ GPT-4o, Claude 3.5 | | No internet | ✅ Works offline | ❌ Requires connection | | Code completion (large) | ❌ Slower | ✅ Fast, accurate | | RAG on private docs | ✅ Ideal | ⚠ Possible but pricier | | Prototyping / dev | ✅ Free, fast iteration | ⚠ Costs add up | | Long documents | ⚠ Context limits vary | ✅ 128k+ context windows |

Running Ollama as a Server

Bash

# Start as a background service
ollama serve &

# Or set environment variables for network access
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Docker
docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# With Open WebUI (browser UI like ChatGPT)
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Key Takeaways

Ollama is production-ready for internal tools, RAG on private documents, and development workflows where data privacy matters.
7B–14B models hit a sweet spot: fast enough on a modern laptop, capable enough for most developer tasks.
The OpenAI-compatible API means migrating between local and cloud is a one-line change — useful for testing and cost management.
Embeddings + nomic-embed-text make building local RAG pipelines trivial and completely free to run.
Modelfiles let you bake in system prompts and parameters — create specialised assistants that behave consistently without prompt engineering in every call.
For production customer-facing AI, cloud models still win on quality and context window size. For everything else, local is increasingly the smarter default.

Ollama: Run Powerful AI Models Locally — No API Keys, No Cost

Why Local AI Is Having a Moment

What Ollama Is

Installation

Running Your First Model

The Model Library

The API

OpenAI-Compatible Endpoint

Native Ollama API

Using Ollama in .NET

Using Ollama in Python

Building a Local RAG Pipeline

Modelfiles: Customise Any Model

Local vs Cloud: When to Use Each

Running Ollama as a Server

Key Takeaways

Enjoyed this article?

Leave a comment