Learnixo
Back to blog
AI Systemsintermediate

Ollama — Running Local LLMs in .NET Development

Use Ollama to run local large language models in .NET development: setup, integration with Semantic Kernel, model selection for clinical tasks, and when local models are appropriate vs. cloud APIs.

Asma Hafeez KhanMay 16, 20265 min read
AIOllamaLocal LLM.NETSemantic Kernel
Share:𝕏

Why Ollama for .NET Development

Ollama runs LLMs locally on your machine — no API key, no network required.

Use Ollama when:
  → Developing AI features without incurring API costs
  → Data privacy: patient data must not leave the local machine during development
  → Testing AI features without requiring cloud credentials
  → Evaluating model behaviour with local data before committing to a cloud API

Ollama limitations:
  → Requires a decent GPU or CPU (llama3.2:3b runs on CPU; larger models need GPU)
  → Response quality is lower than GPT-4o for complex reasoning tasks
  → Not suitable for production clinical systems that require high accuracy
  → Cold start: first generation slow if model isn't cached

Models commonly used with Ollama:
  llama3.2:3b    → Small, fast, good for structured outputs, code generation
  llama3.1:8b    → Better reasoning, slower on CPU
  mistral:7b     → Good for summarisation
  phi3:mini      → Very small, surprisingly capable for simple tasks

Ollama Setup

Bash
# Install Ollama (Windows/macOS/Linux)
# Download from: https://ollama.com

# Pull a model:
ollama pull llama3.2

# Run the Ollama server (starts automatically on install):
ollama serve
# API available at: http://localhost:11434

# List available models:
ollama list

# Test the API directly:
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What is the therapeutic INR range for Warfarin?",
  "stream": false
}'

Ollama in Docker Compose (Development)

YAML
# docker-compose.dev.yml
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama-data:/root/.ollama  # persist downloaded models
    ports:
      - "11434:11434"
    # GPU support (if available):
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

  # Pull the model on startup:
  ollama-init:
    image: ollama/ollama:latest
    depends_on:
      - ollama
    entrypoint: /bin/sh -c "sleep 5 && ollama pull llama3.2 && ollama pull phi3:mini"
    environment:
      - OLLAMA_HOST=http://ollama:11434

volumes:
  ollama-data:

Semantic Kernel with Ollama

C#
// NuGet: Microsoft.SemanticKernel.Connectors.Ollama (preview)

var kernel = Kernel.CreateBuilder()
    .AddOllamaChatCompletion(
        modelId:  "llama3.2",
        endpoint: new Uri("http://localhost:11434"))
    .Build();

// For text embeddings (for RAG/vector search):
var kernel = Kernel.CreateBuilder()
    .AddOllamaTextEmbeddingGeneration(
        modelId:  "nomic-embed-text",
        endpoint: new Uri("http://localhost:11434"))
    .Build();

// Environment-aware kernel: Ollama locally, Azure OpenAI in production
private static IKernelBuilder ConfigureAI(
    IConfiguration config, IWebHostEnvironment env)
{
    var kernelBuilder = Kernel.CreateBuilder();

    if (env.IsDevelopment())
    {
        kernelBuilder.AddOllamaChatCompletion(
            modelId:  config["AI:OllamaModel"] ?? "llama3.2",
            endpoint: new Uri(config["AI:OllamaEndpoint"] ?? "http://localhost:11434"));
    }
    else
    {
        kernelBuilder.AddAzureOpenAIChatCompletion(
            deploymentName: config["AzureOpenAI:DeploymentName"]!,
            endpoint:       config["AzureOpenAI:Endpoint"]!,
            apiKey:         config["AzureOpenAI:ApiKey"]!);
    }

    return kernelBuilder;
}

Prompt Engineering for Smaller Models

C#
// Smaller local models need more explicit, structured prompts
// They don't handle vague or complex multi-step prompts as well as GPT-4o

// LESS EFFECTIVE with small models:
var prompt = "Tell me about this patient's prescription.";

// MORE EFFECTIVE with small models — explicit structure:
var prompt = """
    Task: Summarise the prescription below.
    Format: Respond with exactly 2 sentences.
    Include: medication name, current dose, and any warnings.
    Do not include any other information.

    Prescription:
    Medication: Warfarin
    Dose: 5mg daily
    Status: Approved
    INR at approval: 2.3
    Last INR: 2.1 (2 days ago)
    """;

// For structured JSON output — use Semantic Kernel's structured output:
var executionSettings = new PromptExecutionSettings
{
    ExtensionData = new Dictionary<string, object>
    {
        ["format"] = "json"  // request JSON output from the model
    }
};

// Or use a schema prompt:
var jsonPrompt = """
    Extract the medication information from the text below.
    Respond ONLY with valid JSON matching this schema:
    { "medication": string, "dose_mg": number, "frequency": string }

    Text: "The patient is on Warfarin 5mg taken once daily."
    """;

Comparing Local vs Cloud for Clinical Tasks

Task: summarise a prescription for a handover note
  Llama3.2 (local):    adequate — factual summary, consistent formatting
  GPT-4o (Azure):      excellent — nuanced, better clinical language
  Decision: local for development, cloud for production

Task: detect potentially dangerous drug interactions from a free-text note
  Llama3.2 (local):    poor — misses subtle interactions, hallucinates references
  GPT-4o (Azure):      much better — still requires clinical validation
  Decision: never rely on LLM alone for drug interaction checking
            use a validated drug interaction database (BNF API, Lexicomp)

Task: extract structured data from a discharge letter PDF
  Llama3.2 (local):    workable for simple extractions
  GPT-4o with vision:  much better for complex PDF layouts with tables
  Decision: local for prototyping, cloud for production extraction

Task: code generation and summarisation in development
  Llama3.2 (local):    excellent — fast, free, no data leaves the machine
  Decision: Ollama only — no need for cloud API for development tasks

Production issue I've seen: A team used Ollama with llama3.1:8b for a clinical document summarisation feature in production (not development). The model ran on CPU — each summarisation took 45-90 seconds. Nurses stopped using the feature after 2 days. The model also occasionally produced "hallucinated" medication names that were close to real names but wrong (e.g., "Warfarinol" instead of "Warfarin"). In production, use cloud APIs for quality and latency guarantees. Reserve Ollama for local development where a 45-second response is acceptable and data privacy during development is the requirement.


Key Takeaway

Ollama runs LLMs locally — useful for development without API keys and for keeping patient data on the local machine. Use it with Semantic Kernel via AddOllamaChatCompletion. Structure prompts explicitly for smaller models — they perform better with clear format instructions. Switch to Azure OpenAI in production: better quality, sub-second response times, SLA guarantees. Never use local models for safety-critical clinical decisions — neither local nor cloud models replace validated clinical databases for drug interaction checking or dosage decisions.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.