AI / Local Inference#
This flake runs a fully local AI stack — no cloud APIs, no data leaving the network. All inference happens on mokou (GTX 1080) via Ollama, and consumers talk to it over Tailscale.
Stack overview#
```mermaid flowchart TD subgraph mokou[“mokou — GTX 1080 (SM 6.1)”] ollama[“Ollama :11434”] ollama — vlm[“qwen2.5-vl:7b\n(vision OCR)”] ollama — llm[“qwen2.5:3b\n(text / tagging)”] ollama — emb[“nomic-embed-text\n(embeddings)”] end
subgraph ereshkigal["ereshkigal"]
pgpt["paperless-gpt :8013\n(auto-tag sidecar)"]
end
subgraph desktop["mokou (desktop)"]
agents["agents package\nsummarize / triage / classify …"]
end
pgpt -->|"LLM_MODEL qwen2.5:3b\nVISION_LLM_MODEL qwen2.5-vl:7b"| ollama
agents -->|"OLLAMA_HOST=http://localhost:11434"| ollama
```
Components#
| Component | Host | Purpose |
|---|---|---|
| Ollama | mokou | Inference server, GPU-accelerated |
| paperless-gpt | ereshkigal | LLM auto-tagging sidecar for Paperless-NGX |
| agents | mokou (CLI) | Personal data archeology tools |
Key design decisions#
Everything runs locally. Documents contain personal financial, medical, and legal data — sending them to an external API is not acceptable.
mokou provides GPU inference for the whole network. Rather than replicating GPU setup per host, every consumer reaches mokou over Tailscale. The Tailscale FQDN keeps addressing stable regardless of DHCP assignment.
Small models with constrained output. At 3b parameters, reliability comes from forcing the model into a tight output format (single word, JSON schema, CSV) rather than asking for open-ended prose. See the individual pages for prompt patterns.