Ollama on mokou#
Ollama runs on mokou (GTX 1080 Ti, Pascal architecture) and serves the entire network as a shared inference endpoint over Tailscale.
Hardware#
| GPU | NVIDIA GeForce GTX 1080 |
| Compute capability | SM 6.1 (Pascal) |
| VRAM | 8 GB GDDR5X |
| CPU | Intel i7-4790K |
The CUDA SM 6.1 problem#
nixpkgs builds Ollama with CUDA support targeting SM 7.5+ (Turing and newer)
by default. Pascal (GTX 10-series, SM 6.1) is excluded, so a stock
services.ollama.acceleration = "cuda" silently falls back to CPU inference —
you get correct output but at 1/10th the speed.
The fix is to override cudaArches at the package level:
services.ollama = {
enable = true;
acceleration = "cuda";
package = pkgs.ollama.override {
acceleration = "cuda";
cudaArches = ["61"]; # SM 6.1 = GTX 1080 / Pascal
};
...
};
This rebuilds Ollama with a PTX/SASS target for SM 6.1. Build time is significant (30–60 min); after that inference is GPU-accelerated and sub-second for 3b models.
Warning
If you see ollama run qwen2.5:3b responding in seconds but the process shows
100% CPU in htop, the CUDA override likely didn’t take. Check with
nvidia-smi during inference — GPU utilisation should be nonzero.
NixOS configuration#
# hosts/x86_64-nixos/mokou/default.nix
services.ollama = {
enable = true;
acceleration = "cuda";
package = pkgs.ollama.override {
acceleration = "cuda";
cudaArches = ["61"];
};
host = "0.0.0.0"; # listen on all interfaces, not just localhost
port = 11434;
loadModels = [
"qwen2.5-vl:7b" # VLM for vision OCR (paperless-gpt)
"qwen2.5:3b" # Text model for tagging / titling / agents
"nomic-embed-text" # Embeddings for RAG / semantic search
];
home = "/var/lib/ollama";
environmentVariables = {
OLLAMA_MAX_QUEUE = "4";
};
};
host = "0.0.0.0" is required so other hosts on Tailscale can reach the endpoint.
The firewall allows port 11434 inbound on the Tailscale interface only.
Models#
| Model | Size | Use |
|---|---|---|
qwen2.5-vl:7b |
~5 GB VRAM | Vision OCR — paperless-gpt image processing |
qwen2.5:3b |
~2 GB VRAM | Text — tagging, titling, agents |
nomic-embed-text |
~274 MB VRAM | Embeddings — RAG, semantic routing |
loadModels pre-pulls these on activation so the first request doesn’t stall
waiting for a download.
Compute capability reference#
| Architecture | SM version | Example GPUs |
|---|---|---|
| Pascal | 6.0, 6.1 | GTX 10-series, Titan X (Pascal) |
| Volta | 7.0 | Titan V, Tesla V100 |
| Turing | 7.5 | GTX 16/RTX 20-series ← nixpkgs default floor |
| Ampere | 8.0, 8.6 | RTX 30-series |
| Ada Lovelace | 8.9 | RTX 40-series |
If you have a Turing or newer GPU, the nixpkgs default works without the override.