Ollama: Run Powerful AI Models Locally — No API Keys, No Cost
The complete developer guide to Ollama — install and run Llama 3, Mistral, Gemma, and Phi-4 locally, build .NET and Python apps against local models, and understand when local AI beats cloud AI.
Why Local AI Is Having a Moment
For two years, "using AI" meant calling an API, paying per token, and sending your data to a cloud provider. That's still the right choice for many production use cases — but a serious alternative has quietly matured.
Ollama makes running large language models locally as simple as ollama run llama3. One command, no API keys, no billing, no data leaving your machine.
For developers building internal tools, prototyping AI features, or working with sensitive data, local models have crossed the threshold from "interesting experiment" to "genuinely viable."
What Ollama Is
Ollama is an open-source runtime that manages downloading, running, and serving local LLMs. It handles:
- Model downloads and caching from the Ollama model library
- GPU/CPU inference with automatic hardware detection
- An OpenAI-compatible REST API (so your existing code often works unchanged)
- Model versioning and multiple concurrent models
Under the hood it uses llama.cpp for CPU/Metal inference and native CUDA for NVIDIA GPUs.
Installation
# macOS / Linux — single command
curl -fsSL https://ollama.com/install.sh | sh
# Windows — download from https://ollama.com/download
# Or via winget:
winget install Ollama.Ollama
# Verify
ollama --versionRunning Your First Model
# Pull and run Llama 3.2 (3B — fast, low RAM)
ollama run llama3.2
# Pull without running
ollama pull llama3.2
# Run in the terminal
>>> What is the difference between a process and a thread?That's it. The model downloads once (~2GB) and runs locally from that point forward.
The Model Library
Ollama's library covers every major open-source model family. Here's a practical guide:
| Model | Size | RAM Needed | Best For |
|---|---|---|---|
| llama3.2:3b | 2GB | 4GB | Fast, everyday tasks |
| llama3.2:11b | 7GB | 12GB | Balanced quality/speed |
| llama3.1:70b | 40GB | 64GB | Near GPT-4 quality |
| mistral:7b | 4GB | 8GB | Instruction following |
| gemma3:4b | 3GB | 6GB | Google's small model |
| phi4:14b | 8GB | 16GB | Microsoft, great at reasoning |
| codellama:13b | 8GB | 16GB | Code generation |
| deepseek-r1:8b | 5GB | 8GB | Reasoning (shows thinking) |
| qwen2.5-coder:7b | 4GB | 8GB | Code-specific tasks |
| nomic-embed-text | 274MB | 1GB | Embeddings/RAG |
# List all available models
ollama list
# Pull specific version
ollama pull mistral:7b-instruct-v0.3
# Remove a model
ollama rm codellama:13bThe API
Once Ollama is running, it exposes a local REST API on http://localhost:11434.
OpenAI-Compatible Endpoint
# Drop-in replacement for OpenAI — change the base URL, keep your code
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{ "role": "system", "content": "You are a senior .NET engineer." },
{ "role": "user", "content": "Explain the difference between IEnumerable and IQueryable." }
],
"stream": false
}'Native Ollama API
# Generate
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.2","prompt":"Write a SQL query that...","stream":false}'
# Chat
curl http://localhost:11434/api/chat \
-d '{
"model": "mistral",
"messages": [{"role":"user","content":"Review this code: ..."}]
}'
# Embeddings
curl http://localhost:11434/api/embed \
-d '{"model":"nomic-embed-text","input":"Customer churned after 3 months"}'Using Ollama in .NET
The OpenAI-compatible endpoint means you can use the official Azure.AI.OpenAI or Anthropic SDK with a base URL swap — or use the OllamaSharp library for native Ollama features.
// Option 1: OpenAI SDK pointing at local Ollama
using OpenAI;
using OpenAI.Chat;
var client = new ChatClient(
model: "llama3.2",
credential: new ApiKeyCredential("ollama"), // any string works
options: new OpenAIClientOptions {
Endpoint = new Uri("http://localhost:11434/v1")
}
);
var response = await client.CompleteChatAsync(
new UserChatMessage("Explain async/await in C# in 3 sentences.")
);
Console.WriteLine(response.Value.Content[0].Text);// Option 2: OllamaSharp (native, full feature set)
using OllamaSharp;
using OllamaSharp.Models.Chat;
var ollama = new OllamaApiClient("http://localhost:11434");
ollama.SelectedModel = "llama3.2";
// Streaming response
await foreach (var token in ollama.GenerateAsync("Explain CQRS pattern"))
Console.Write(token?.Response);
// Chat with history
var chat = new Chat(ollama);
var history = new List<Message>
{
new() { Role = ChatRole.System, Content = "You are a .NET expert." }
};
while (true)
{
Console.Write("You: ");
var input = Console.ReadLine() ?? "";
history.Add(new() { Role = ChatRole.User, Content = input });
Console.Write("Assistant: ");
var reply = "";
await foreach (var token in chat.SendAsync(history))
{
Console.Write(token);
reply += token;
}
Console.WriteLine();
history.Add(new() { Role = ChatRole.Assistant, Content = reply });
}Using Ollama in Python
# pip install ollama
import ollama
# Simple generation
response = ollama.generate(model='llama3.2', prompt='Explain Redis sorted sets')
print(response['response'])
# Chat
response = ollama.chat(
model='mistral',
messages=[
{'role': 'system', 'content': 'You are a database expert.'},
{'role': 'user', 'content': 'When should I use MongoDB over PostgreSQL?'},
]
)
print(response['message']['content'])
# Streaming
for chunk in ollama.generate(model='llama3.2', prompt='List 5 Python tips', stream=True):
print(chunk['response'], end='', flush=True)
# Embeddings — for RAG pipelines
embed = ollama.embed(model='nomic-embed-text', input='Your document text here')
vector = embed['embeddings'][0] # list of 768 floatsBuilding a Local RAG Pipeline
import ollama
import numpy as np
from pathlib import Path
# Simple in-memory vector store (use ChromaDB or pgvector in production)
class LocalVectorStore:
def __init__(self):
self.docs: list[str] = []
self.embeddings: list[list[float]] = []
def add(self, text: str):
result = ollama.embed(model='nomic-embed-text', input=text)
self.embeddings.append(result['embeddings'][0])
self.docs.append(text)
def search(self, query: str, top_k: int = 3) -> list[str]:
q_embed = ollama.embed(model='nomic-embed-text', input=query)['embeddings'][0]
scores = [
np.dot(q_embed, doc_e) / (np.linalg.norm(q_embed) * np.linalg.norm(doc_e))
for doc_e in self.embeddings
]
top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
return [self.docs[i] for i in top]
# Index your documents
store = LocalVectorStore()
for path in Path("docs/").glob("*.txt"):
store.add(path.read_text())
# Query with context
def ask(question: str) -> str:
context = "\n\n".join(store.search(question))
prompt = f"""Answer the question using only the context below.
If the answer isn't in the context, say "I don't know."
Context:
{context}
Question: {question}"""
response = ollama.generate(model='llama3.2', prompt=prompt)
return response['response']
print(ask("What is our refund policy?"))Modelfiles: Customise Any Model
# Create a specialized assistant
# Save as Modelfile
FROM llama3.2
SYSTEM """
You are a senior .NET architect with 15 years of experience.
You answer questions concisely, with code examples when relevant.
You prefer Clean Architecture and always consider testability.
When reviewing code, you identify performance issues first.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 8192# Build your custom model
ollama create dotnet-expert -f Modelfile
# Use it
ollama run dotnet-expertLocal vs Cloud: When to Use Each
| Scenario | Local (Ollama) | Cloud (OpenAI/Claude) | |---|---|---| | Sensitive data | ✅ Data never leaves | ❌ Data sent to provider | | High volume | ✅ No per-token cost | ❌ Can get expensive | | Best quality | ❌ Smaller models | ✅ GPT-4o, Claude 3.5 | | No internet | ✅ Works offline | ❌ Requires connection | | Code completion (large) | ❌ Slower | ✅ Fast, accurate | | RAG on private docs | ✅ Ideal | ⚠ Possible but pricier | | Prototyping / dev | ✅ Free, fast iteration | ⚠ Costs add up | | Long documents | ⚠ Context limits vary | ✅ 128k+ context windows |
Running Ollama as a Server
# Start as a background service
ollama serve &
# Or set environment variables for network access
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# Docker
docker run -d \
--gpus=all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
# With Open WebUI (browser UI like ChatGPT)
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:mainKey Takeaways
- Ollama is production-ready for internal tools, RAG on private documents, and development workflows where data privacy matters.
- 7B–14B models hit a sweet spot: fast enough on a modern laptop, capable enough for most developer tasks.
- The OpenAI-compatible API means migrating between local and cloud is a one-line change — useful for testing and cost management.
- Embeddings + nomic-embed-text make building local RAG pipelines trivial and completely free to run.
- Modelfiles let you bake in system prompts and parameters — create specialised assistants that behave consistently without prompt engineering in every call.
- For production customer-facing AI, cloud models still win on quality and context window size. For everything else, local is increasingly the smarter default.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.