Ollama on mokou#
Ollama runs on mokou (GTX 1080, Pascal architecture) and serves the entire
network as a shared inference endpoint over Tailscale. ereshkigal also runs a
local ollama loading only nomic-embed-text (for Open-WebUI RAG embeddings)
while delegating vision/text inference to mokou — see the AI overview
for the full topology.
Hardware#
| GPU | NVIDIA GeForce GTX 1080 |
| Compute capability | SM 6.1 (Pascal) |
| VRAM | 8 GB GDDR5X |
| CPU | Intel i7-4790K |
The CUDA SM 6.1 problem#
nixpkgs builds Ollama with CUDA support targeting SM 7.5+ (Turing and newer)
by default. Pascal (GTX 10-series, SM 6.1) is excluded, so a stock
services.ollama.acceleration = "cuda" silently falls back to CPU inference —
you get correct output but at 1/10th the speed.
The fix is to override cudaArches at the package level:
services.ollama = {
enable = true;
acceleration = "cuda";
package = pkgs.ollama.override {
acceleration = "cuda";
cudaArches = ["61"]; # SM 6.1 = GTX 1080 / Pascal
};
...
};
This rebuilds Ollama with a PTX/SASS target for SM 6.1. Build time is significant (~15–30 min on first switch); after that inference is GPU-accelerated and sub-second for 3b models.
Warning
If you see ollama run qwen2.5:3b responding in seconds but the process shows
100% CPU in htop, the CUDA override likely didn’t take. Check with
nvidia-smi during inference — GPU utilisation should be nonzero.
NixOS configuration#
# hosts/x86_64-nixos/mokou/default.nix
services.ollama = {
enable = true;
acceleration = "cuda";
package = pkgs.ollama.override {
acceleration = "cuda";
cudaArches = ["61"];
};
host = "0.0.0.0"; # listen on all interfaces, not just localhost
port = 11434;
loadModels = [
"qwen2.5vl:7b" # VLM for vision OCR (paperless-gpt)
"qwen2.5:3b" # Text model for tagging / titling / agents
"qwen2.5-coder:7b" # Code model for offline flake maintenance (llm CLI)
"qwen2.5-coder:3b" # Smaller coder fallback under VRAM pressure
"nomic-embed-text" # Embeddings for RAG / semantic search
];
home = "/data/ollama";
environmentVariables = {
OLLAMA_MAX_QUEUE = "4";
};
};
Models live on an external LUKS-encrypted ext4 disk (cryptdata) mounted at
/data, so mokou can’t use the module’s default DynamicUser. Instead it
defines a static users.users.ollama system user (with home = "/data/ollama")
and forces the service off DynamicUser:
# hosts/x86_64-nixos/mokou/default.nix
systemd.services.ollama.serviceConfig = {
DynamicUser = lib.mkForce false;
PrivateUsers = lib.mkForce false;
User = lib.mkForce "ollama";
Group = lib.mkForce "ollama";
};
(ereshkigal’s local ollama keeps the default home = "/var/lib/ollama".)
host = "0.0.0.0" is required so other hosts on Tailscale can reach the endpoint.
The firewall opens port 11434 (networking.firewall.allowedTCPPorts = [11434]),
which exposes it on the LAN and Tailscale — keep it off the internet.
Models#
| Model | Size | Use |
|---|---|---|
qwen2.5vl:7b |
~5 GB VRAM | Vision OCR — paperless-gpt image processing |
qwen2.5:3b |
~2 GB VRAM | Text — tagging, titling, agents |
qwen2.5-coder:7b |
~4.7 GB VRAM | Code — offline flake maintenance via the llm CLI |
qwen2.5-coder:3b |
~2 GB VRAM | Smaller coder fallback under VRAM pressure |
nomic-embed-text |
~274 MB VRAM | Embeddings — RAG, semantic routing |
loadModels pre-pulls these on activation so the first request doesn’t stall
waiting for a download — and so the coder models are present offline, which is
the whole point of the llm CLI outage workflow. The two qwen2.5-coder
models fit the 8 GB card (Ollama unloads idle models, so the vision and coder models
swap rather than co-resident).
Compute capability reference#
| Architecture | SM version | Example GPUs |
|---|---|---|
| Pascal | 6.0, 6.1 | GTX 10-series, Titan X (Pascal) |
| Volta | 7.0 | Titan V, Tesla V100 |
| Turing | 7.5 | GTX 16/RTX 20-series ← nixpkgs default floor |
| Ampere | 8.0, 8.6 | RTX 30-series |
| Ada Lovelace | 8.9 | RTX 40-series |
If you have a Turing or newer GPU, the nixpkgs default works without the override.