AI / Local Inference#
This flake runs a fully local AI stack — no cloud APIs, no data leaving the network. All inference happens on mokou (GTX 1080) via Ollama, and consumers talk to it over Tailscale.
Stack overview#
```mermaid flowchart TD subgraph mokou[“mokou — GTX 1080 (SM 6.1)”] ollama[“Ollama :11434”] ollama — vlm[“qwen2.5vl:7b\n(vision OCR)”] ollama — txt[“qwen2.5:3b\n(text / tagging)”] ollama — coder[“qwen2.5-coder:7b/:3b\n(code / flake maintenance)”] ollama — emb[“nomic-embed-text\n(embeddings)”] end
subgraph ereshkigal["ereshkigal"]
pgpt["paperless-gpt :8013\n(auto-tag sidecar)"]
owui["Open-WebUI :3000\n(LLM chat + RAG)"]
allm["AnythingLLM :13001\n(RAG chat sidecar)"]
end
subgraph dev["dev shell / nix run"]
agents["agents package\nsummarize / triage / classify …"]
llmcli["llm CLI\n-t nix-maint"]
kiwixask["kiwix-ask\n(grounded retrieval)"]
end
subgraph voile["voile — NAS"]
kiwix["Kiwix :8092\n(offline Wikipedia / DevDocs / SE)"]
end
pgpt -->|"LLM_MODEL qwen2.5:3b\nVISION_LLM_MODEL qwen2.5vl:7b"| ollama
owui -->|"OLLAMA_BASE_URL\nchat + nomic-embed-text RAG"| ollama
allm -->|"LLM_PROVIDER=ollama\nnomic-embed-text embeddings"| ollama
agents -->|"OLLAMA_HOST"| ollama
llmcli -->|"OLLAMA_HOST → mokou\nqwen2.5-coder:7b"| ollama
kiwixask -->|"retrieve articles"| kiwix
kiwixask -->|"ground answer"| llmcli
```
Components#
| Component | Host | Purpose |
|---|---|---|
| Ollama | mokou | Inference server, GPU-accelerated |
| paperless-gpt | ereshkigal | LLM auto-tagging sidecar for Paperless-NGX |
| Open-WebUI | ereshkigal | LLM chat + document RAG over mokou’s Ollama (HTTP 3000 / HTTPS 3001) |
| AnythingLLM | ereshkigal | RAG chat sidecar over the document library (HTTP 13001 / HTTPS 13002) |
| agents | mokou (CLI) | Personal data archeology tools |
| llm CLI | dev shell | Offline flake-maintenance assistant (llm -t nix-maint) |
| llm-maint | dev shell | llm -t nix-maint with live repo facts (nixpkgs lock, modules, hosts) injected |
| kiwix-ask | dev shell | Retrieval-grounded answers from voile’s Kiwix |
Key design decisions#
Everything runs locally. Documents contain personal financial, medical, and legal data — sending them to an external API is not acceptable.
mokou provides GPU inference for the whole network. Rather than replicating GPU setup per host, every consumer reaches mokou over Tailscale. The Tailscale FQDN keeps addressing stable regardless of DHCP assignment.
Small models with constrained output. At 3b parameters, reliability comes from forcing the model into a tight output format (single word, JSON schema, CSV) rather than asking for open-ended prose. See the individual pages for prompt patterns.
Offline maintenance, not just offline inference. The same Ollama also backs the
llm CLI and kiwix-ask, so the flake itself stays maintainable during a
cloud-AI outage: qwen2.5-coder:7b stands in for ChatGPT/Claude, and voile’s offline
Kiwix archive stands in for the open web.