Ollama on mokou#

Ollama runs on mokou (GTX 1080, Pascal architecture) and serves the entire network as a shared inference endpoint over Tailscale. ereshkigal also runs a local ollama loading only nomic-embed-text (for Open-WebUI RAG embeddings) while delegating vision/text inference to mokou — see the AI overview for the full topology.

Hardware#

GPU NVIDIA GeForce GTX 1080
Compute capability SM 6.1 (Pascal)
VRAM 8 GB GDDR5X
CPU Intel i7-4790K

The CUDA SM 6.1 problem#

nixpkgs builds Ollama with CUDA support targeting SM 7.5+ (Turing and newer) by default. Pascal (GTX 10-series, SM 6.1) is excluded, so a stock services.ollama.acceleration = "cuda" silently falls back to CPU inference — you get correct output but at 1/10th the speed.

The fix is to override cudaArches at the package level:

services.ollama = {
  enable = true;
  acceleration = "cuda";
  package = pkgs.ollama.override {
    acceleration = "cuda";
    cudaArches = ["61"];   # SM 6.1 = GTX 1080 / Pascal
  };
  ...
};

This rebuilds Ollama with a PTX/SASS target for SM 6.1. Build time is significant (~15–30 min on first switch); after that inference is GPU-accelerated and sub-second for 3b models.

Warning

If you see ollama run qwen2.5:3b responding in seconds but the process shows 100% CPU in htop, the CUDA override likely didn’t take. Check with nvidia-smi during inference — GPU utilisation should be nonzero.

NixOS configuration#

# hosts/x86_64-nixos/mokou/default.nix
services.ollama = {
  enable = true;
  acceleration = "cuda";
  package = pkgs.ollama.override {
    acceleration = "cuda";
    cudaArches = ["61"];
  };
  host = "0.0.0.0";   # listen on all interfaces, not just localhost
  port = 11434;
  loadModels = [
    "qwen2.5vl:7b"       # VLM for vision OCR (paperless-gpt)
    "qwen2.5:3b"         # Text model for tagging / titling / agents
    "qwen2.5-coder:7b"   # Code model for offline flake maintenance (llm CLI)
    "qwen2.5-coder:3b"   # Smaller coder fallback under VRAM pressure
    "nomic-embed-text"   # Embeddings for RAG / semantic search
  ];
  home = "/data/ollama";
  environmentVariables = {
    OLLAMA_MAX_QUEUE = "4";
  };
};

Models live on an external LUKS-encrypted ext4 disk (cryptdata) mounted at /data, so mokou can’t use the module’s default DynamicUser. Instead it defines a static users.users.ollama system user (with home = "/data/ollama") and forces the service off DynamicUser:

# hosts/x86_64-nixos/mokou/default.nix
systemd.services.ollama.serviceConfig = {
  DynamicUser = lib.mkForce false;
  PrivateUsers = lib.mkForce false;
  User = lib.mkForce "ollama";
  Group = lib.mkForce "ollama";
};

(ereshkigal’s local ollama keeps the default home = "/var/lib/ollama".)

host = "0.0.0.0" is required so other hosts on Tailscale can reach the endpoint. The firewall opens port 11434 (networking.firewall.allowedTCPPorts = [11434]), which exposes it on the LAN and Tailscale — keep it off the internet.

Models#

Model Size Use
qwen2.5vl:7b ~5 GB VRAM Vision OCR — paperless-gpt image processing
qwen2.5:3b ~2 GB VRAM Text — tagging, titling, agents
qwen2.5-coder:7b ~4.7 GB VRAM Code — offline flake maintenance via the llm CLI
qwen2.5-coder:3b ~2 GB VRAM Smaller coder fallback under VRAM pressure
nomic-embed-text ~274 MB VRAM Embeddings — RAG, semantic routing

loadModels pre-pulls these on activation so the first request doesn’t stall waiting for a download — and so the coder models are present offline, which is the whole point of the llm CLI outage workflow. The two qwen2.5-coder models fit the 8 GB card (Ollama unloads idle models, so the vision and coder models swap rather than co-resident).

Compute capability reference#

Architecture SM version Example GPUs
Pascal 6.0, 6.1 GTX 10-series, Titan X (Pascal)
Volta 7.0 Titan V, Tesla V100
Turing 7.5 GTX 16/RTX 20-series ← nixpkgs default floor
Ampere 8.0, 8.6 RTX 30-series
Ada Lovelace 8.9 RTX 40-series

If you have a Turing or newer GPU, the nixpkgs default works without the override.