Ollama on mokou#

Ollama runs on mokou (GTX 1080 Ti, Pascal architecture) and serves the entire network as a shared inference endpoint over Tailscale.

Hardware#


GPU	NVIDIA GeForce GTX 1080
Compute capability	SM 6.1 (Pascal)
VRAM	8 GB GDDR5X
CPU	Intel i7-4790K

The CUDA SM 6.1 problem#

nixpkgs builds Ollama with CUDA support targeting SM 7.5+ (Turing and newer) by default. Pascal (GTX 10-series, SM 6.1) is excluded, so a stock services.ollama.acceleration = "cuda" silently falls back to CPU inference — you get correct output but at 1/10th the speed.

The fix is to override cudaArches at the package level:

services.ollama = {
  enable = true;
  acceleration = "cuda";
  package = pkgs.ollama.override {
    acceleration = "cuda";
    cudaArches = ["61"];   # SM 6.1 = GTX 1080 / Pascal
  };
  ...
};

This rebuilds Ollama with a PTX/SASS target for SM 6.1. Build time is significant (30–60 min); after that inference is GPU-accelerated and sub-second for 3b models.

Warning

If you see ollama run qwen2.5:3b responding in seconds but the process shows 100% CPU in htop, the CUDA override likely didn’t take. Check with nvidia-smi during inference — GPU utilisation should be nonzero.

NixOS configuration#

# hosts/x86_64-nixos/mokou/default.nix
services.ollama = {
  enable = true;
  acceleration = "cuda";
  package = pkgs.ollama.override {
    acceleration = "cuda";
    cudaArches = ["61"];
  };
  host = "0.0.0.0";   # listen on all interfaces, not just localhost
  port = 11434;
  loadModels = [
    "qwen2.5-vl:7b"      # VLM for vision OCR (paperless-gpt)
    "qwen2.5:3b"         # Text model for tagging / titling / agents
    "nomic-embed-text"   # Embeddings for RAG / semantic search
  ];
  home = "/var/lib/ollama";
  environmentVariables = {
    OLLAMA_MAX_QUEUE = "4";
  };
};

host = "0.0.0.0" is required so other hosts on Tailscale can reach the endpoint. The firewall allows port 11434 inbound on the Tailscale interface only.

Models#

Model	Size	Use
`qwen2.5-vl:7b`	~5 GB VRAM	Vision OCR — paperless-gpt image processing
`qwen2.5:3b`	~2 GB VRAM	Text — tagging, titling, agents
`nomic-embed-text`	~274 MB VRAM	Embeddings — RAG, semantic routing

loadModels pre-pulls these on activation so the first request doesn’t stall waiting for a download.

Compute capability reference#

Architecture	SM version	Example GPUs
Pascal	6.0, 6.1	GTX 10-series, Titan X (Pascal)
Volta	7.0	Titan V, Tesla V100
Turing	7.5	GTX 16/RTX 20-series ← nixpkgs default floor
Ampere	8.0, 8.6	RTX 30-series
Ada Lovelace	8.9	RTX 40-series

If you have a Turing or newer GPU, the nixpkgs default works without the override.