AI / Local Inference#

This flake runs a fully local AI stack — no cloud APIs, no data leaving the network. All inference happens on mokou (GTX 1080) via Ollama, and consumers talk to it over Tailscale.

Stack overview#

```mermaid flowchart TD subgraph mokou[“mokou — GTX 1080 (SM 6.1)”] ollama[“Ollama :11434”] ollama — vlm[“qwen2.5-vl:7b\n(vision OCR)”] ollama — llm[“qwen2.5:3b\n(text / tagging)”] ollama — emb[“nomic-embed-text\n(embeddings)”] end

subgraph ereshkigal["ereshkigal"]
    pgpt["paperless-gpt :8013\n(auto-tag sidecar)"]
end

subgraph desktop["mokou (desktop)"]
    agents["agents package\nsummarize / triage / classify …"]
end

pgpt -->|"LLM_MODEL qwen2.5:3b\nVISION_LLM_MODEL qwen2.5-vl:7b"| ollama
agents -->|"OLLAMA_HOST=http://localhost:11434"| ollama

```

Components#

Component	Host	Purpose
Ollama	mokou	Inference server, GPU-accelerated
paperless-gpt	ereshkigal	LLM auto-tagging sidecar for Paperless-NGX
agents	mokou (CLI)	Personal data archeology tools

Key design decisions#

Everything runs locally. Documents contain personal financial, medical, and legal data — sending them to an external API is not acceptable.

mokou provides GPU inference for the whole network. Rather than replicating GPU setup per host, every consumer reaches mokou over Tailscale. The Tailscale FQDN keeps addressing stable regardless of DHCP assignment.

Small models with constrained output. At 3b parameters, reliability comes from forcing the model into a tight output format (single word, JSON schema, CSV) rather than asking for open-ended prose. See the individual pages for prompt patterns.