Back to blog
AI Systemsbeginner

Ollama: Run Powerful AI Models Locally — No API Keys, No Cost

The complete developer guide to Ollama — install and run Llama 3, Mistral, Gemma, and Phi-4 locally, build .NET and Python apps against local models, and understand when local AI beats cloud AI.

LearnixoApril 17, 20267 min read
OllamaLocal AILlamaMistralLLM.NETPythonPrivacyOpen Source
Share:𝕏

Why Local AI Is Having a Moment

For two years, "using AI" meant calling an API, paying per token, and sending your data to a cloud provider. That's still the right choice for many production use cases — but a serious alternative has quietly matured.

Ollama makes running large language models locally as simple as ollama run llama3. One command, no API keys, no billing, no data leaving your machine.

For developers building internal tools, prototyping AI features, or working with sensitive data, local models have crossed the threshold from "interesting experiment" to "genuinely viable."


What Ollama Is

Ollama is an open-source runtime that manages downloading, running, and serving local LLMs. It handles:

  • Model downloads and caching from the Ollama model library
  • GPU/CPU inference with automatic hardware detection
  • An OpenAI-compatible REST API (so your existing code often works unchanged)
  • Model versioning and multiple concurrent models

Under the hood it uses llama.cpp for CPU/Metal inference and native CUDA for NVIDIA GPUs.


Installation

Bash
# macOS / Linux  single command
curl -fsSL https://ollama.com/install.sh | sh

# Windows  download from https://ollama.com/download
# Or via winget:
winget install Ollama.Ollama

# Verify
ollama --version

Running Your First Model

Bash
# Pull and run Llama 3.2 (3B  fast, low RAM)
ollama run llama3.2

# Pull without running
ollama pull llama3.2

# Run in the terminal
>>> What is the difference between a process and a thread?

That's it. The model downloads once (~2GB) and runs locally from that point forward.


The Model Library

Ollama's library covers every major open-source model family. Here's a practical guide:

| Model | Size | RAM Needed | Best For | |---|---|---|---| | llama3.2:3b | 2GB | 4GB | Fast, everyday tasks | | llama3.2:11b | 7GB | 12GB | Balanced quality/speed | | llama3.1:70b | 40GB | 64GB | Near GPT-4 quality | | mistral:7b | 4GB | 8GB | Instruction following | | gemma3:4b | 3GB | 6GB | Google's small model | | phi4:14b | 8GB | 16GB | Microsoft, great at reasoning | | codellama:13b | 8GB | 16GB | Code generation | | deepseek-r1:8b | 5GB | 8GB | Reasoning (shows thinking) | | qwen2.5-coder:7b | 4GB | 8GB | Code-specific tasks | | nomic-embed-text | 274MB | 1GB | Embeddings/RAG |

Bash
# List all available models
ollama list

# Pull specific version
ollama pull mistral:7b-instruct-v0.3

# Remove a model
ollama rm codellama:13b

The API

Once Ollama is running, it exposes a local REST API on http://localhost:11434.

OpenAI-Compatible Endpoint

Bash
# Drop-in replacement for OpenAI  change the base URL, keep your code
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      { "role": "system",  "content": "You are a senior .NET engineer." },
      { "role": "user",    "content": "Explain the difference between IEnumerable and IQueryable." }
    ],
    "stream": false
  }'

Native Ollama API

Bash
# Generate
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"Write a SQL query that...","stream":false}'

# Chat
curl http://localhost:11434/api/chat \
  -d '{
    "model": "mistral",
    "messages": [{"role":"user","content":"Review this code: ..."}]
  }'

# Embeddings
curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text","input":"Customer churned after 3 months"}'

Using Ollama in .NET

The OpenAI-compatible endpoint means you can use the official Azure.AI.OpenAI or Anthropic SDK with a base URL swap — or use the OllamaSharp library for native Ollama features.

C#
// Option 1: OpenAI SDK pointing at local Ollama
using OpenAI;
using OpenAI.Chat;

var client = new ChatClient(
    model: "llama3.2",
    credential: new ApiKeyCredential("ollama"),  // any string works
    options: new OpenAIClientOptions {
        Endpoint = new Uri("http://localhost:11434/v1")
    }
);

var response = await client.CompleteChatAsync(
    new UserChatMessage("Explain async/await in C# in 3 sentences.")
);
Console.WriteLine(response.Value.Content[0].Text);
C#
// Option 2: OllamaSharp (native, full feature set)
using OllamaSharp;
using OllamaSharp.Models.Chat;

var ollama = new OllamaApiClient("http://localhost:11434");
ollama.SelectedModel = "llama3.2";

// Streaming response
await foreach (var token in ollama.GenerateAsync("Explain CQRS pattern"))
    Console.Write(token?.Response);

// Chat with history
var chat = new Chat(ollama);
var history = new List<Message>
{
    new() { Role = ChatRole.System, Content = "You are a .NET expert." }
};

while (true)
{
    Console.Write("You: ");
    var input = Console.ReadLine() ?? "";
    history.Add(new() { Role = ChatRole.User, Content = input });

    Console.Write("Assistant: ");
    var reply = "";
    await foreach (var token in chat.SendAsync(history))
    {
        Console.Write(token);
        reply += token;
    }
    Console.WriteLine();
    history.Add(new() { Role = ChatRole.Assistant, Content = reply });
}

Using Ollama in Python

Python
# pip install ollama
import ollama

# Simple generation
response = ollama.generate(model='llama3.2', prompt='Explain Redis sorted sets')
print(response['response'])

# Chat
response = ollama.chat(
    model='mistral',
    messages=[
        {'role': 'system',  'content': 'You are a database expert.'},
        {'role': 'user',    'content': 'When should I use MongoDB over PostgreSQL?'},
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.generate(model='llama3.2', prompt='List 5 Python tips', stream=True):
    print(chunk['response'], end='', flush=True)

# Embeddings  for RAG pipelines
embed = ollama.embed(model='nomic-embed-text', input='Your document text here')
vector = embed['embeddings'][0]   # list of 768 floats

Building a Local RAG Pipeline

Python
import ollama
import numpy as np
from pathlib import Path

# Simple in-memory vector store (use ChromaDB or pgvector in production)
class LocalVectorStore:
    def __init__(self):
        self.docs: list[str] = []
        self.embeddings: list[list[float]] = []

    def add(self, text: str):
        result = ollama.embed(model='nomic-embed-text', input=text)
        self.embeddings.append(result['embeddings'][0])
        self.docs.append(text)

    def search(self, query: str, top_k: int = 3) -> list[str]:
        q_embed = ollama.embed(model='nomic-embed-text', input=query)['embeddings'][0]
        scores = [
            np.dot(q_embed, doc_e) / (np.linalg.norm(q_embed) * np.linalg.norm(doc_e))
            for doc_e in self.embeddings
        ]
        top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
        return [self.docs[i] for i in top]


# Index your documents
store = LocalVectorStore()
for path in Path("docs/").glob("*.txt"):
    store.add(path.read_text())

# Query with context
def ask(question: str) -> str:
    context = "\n\n".join(store.search(question))
    prompt = f"""Answer the question using only the context below.
If the answer isn't in the context, say "I don't know."

Context:
{context}

Question: {question}"""

    response = ollama.generate(model='llama3.2', prompt=prompt)
    return response['response']

print(ask("What is our refund policy?"))

Modelfiles: Customise Any Model

DOCKERFILE
# Create a specialized assistant
# Save as Modelfile

FROM llama3.2

SYSTEM """
You are a senior .NET architect with 15 years of experience.
You answer questions concisely, with code examples when relevant.
You prefer Clean Architecture and always consider testability.
When reviewing code, you identify performance issues first.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
Bash
# Build your custom model
ollama create dotnet-expert -f Modelfile

# Use it
ollama run dotnet-expert

Local vs Cloud: When to Use Each

| Scenario | Local (Ollama) | Cloud (OpenAI/Claude) | |---|---|---| | Sensitive data | ✅ Data never leaves | ❌ Data sent to provider | | High volume | ✅ No per-token cost | ❌ Can get expensive | | Best quality | ❌ Smaller models | ✅ GPT-4o, Claude 3.5 | | No internet | ✅ Works offline | ❌ Requires connection | | Code completion (large) | ❌ Slower | ✅ Fast, accurate | | RAG on private docs | ✅ Ideal | ⚠ Possible but pricier | | Prototyping / dev | ✅ Free, fast iteration | ⚠ Costs add up | | Long documents | ⚠ Context limits vary | ✅ 128k+ context windows |


Running Ollama as a Server

Bash
# Start as a background service
ollama serve &

# Or set environment variables for network access
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Docker
docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# With Open WebUI (browser UI like ChatGPT)
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Key Takeaways

  • Ollama is production-ready for internal tools, RAG on private documents, and development workflows where data privacy matters.
  • 7B–14B models hit a sweet spot: fast enough on a modern laptop, capable enough for most developer tasks.
  • The OpenAI-compatible API means migrating between local and cloud is a one-line change — useful for testing and cost management.
  • Embeddings + nomic-embed-text make building local RAG pipelines trivial and completely free to run.
  • Modelfiles let you bake in system prompts and parameters — create specialised assistants that behave consistently without prompt engineering in every call.
  • For production customer-facing AI, cloud models still win on quality and context window size. For everything else, local is increasingly the smarter default.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.