Ollama on mokou#

Ollama runs on mokou (GTX 1080 Ti, Pascal architecture) and serves the entire network as a shared inference endpoint over Tailscale.

Hardware#

GPU NVIDIA GeForce GTX 1080
Compute capability SM 6.1 (Pascal)
VRAM 8 GB GDDR5X
CPU Intel i7-4790K

The CUDA SM 6.1 problem#

nixpkgs builds Ollama with CUDA support targeting SM 7.5+ (Turing and newer) by default. Pascal (GTX 10-series, SM 6.1) is excluded, so a stock services.ollama.acceleration = "cuda" silently falls back to CPU inference — you get correct output but at 1/10th the speed.

The fix is to override cudaArches at the package level:

services.ollama = {
  enable = true;
  acceleration = "cuda";
  package = pkgs.ollama.override {
    acceleration = "cuda";
    cudaArches = ["61"];   # SM 6.1 = GTX 1080 / Pascal
  };
  ...
};

This rebuilds Ollama with a PTX/SASS target for SM 6.1. Build time is significant (30–60 min); after that inference is GPU-accelerated and sub-second for 3b models.

Warning

If you see ollama run qwen2.5:3b responding in seconds but the process shows 100% CPU in htop, the CUDA override likely didn’t take. Check with nvidia-smi during inference — GPU utilisation should be nonzero.

NixOS configuration#

# hosts/x86_64-nixos/mokou/default.nix
services.ollama = {
  enable = true;
  acceleration = "cuda";
  package = pkgs.ollama.override {
    acceleration = "cuda";
    cudaArches = ["61"];
  };
  host = "0.0.0.0";   # listen on all interfaces, not just localhost
  port = 11434;
  loadModels = [
    "qwen2.5-vl:7b"      # VLM for vision OCR (paperless-gpt)
    "qwen2.5:3b"         # Text model for tagging / titling / agents
    "nomic-embed-text"   # Embeddings for RAG / semantic search
  ];
  home = "/var/lib/ollama";
  environmentVariables = {
    OLLAMA_MAX_QUEUE = "4";
  };
};

host = "0.0.0.0" is required so other hosts on Tailscale can reach the endpoint. The firewall allows port 11434 inbound on the Tailscale interface only.

Models#

Model Size Use
qwen2.5-vl:7b ~5 GB VRAM Vision OCR — paperless-gpt image processing
qwen2.5:3b ~2 GB VRAM Text — tagging, titling, agents
nomic-embed-text ~274 MB VRAM Embeddings — RAG, semantic routing

loadModels pre-pulls these on activation so the first request doesn’t stall waiting for a download.

Compute capability reference#

Architecture SM version Example GPUs
Pascal 6.0, 6.1 GTX 10-series, Titan X (Pascal)
Volta 7.0 Titan V, Tesla V100
Turing 7.5 GTX 16/RTX 20-series ← nixpkgs default floor
Ampere 8.0, 8.6 RTX 30-series
Ada Lovelace 8.9 RTX 40-series

If you have a Turing or newer GPU, the nixpkgs default works without the override.