AI / Local Inference#

This flake runs a fully local AI stack — no cloud APIs, no data leaving the network. All inference happens on mokou (GTX 1080) via Ollama, and consumers talk to it over Tailscale.

Stack overview#

```mermaid flowchart TD subgraph mokou[“mokou — GTX 1080 (SM 6.1)”] ollama[“Ollama :11434”] ollama — vlm[“qwen2.5vl:7b\n(vision OCR)”] ollama — txt[“qwen2.5:3b\n(text / tagging)”] ollama — coder[“qwen2.5-coder:7b/:3b\n(code / flake maintenance)”] ollama — emb[“nomic-embed-text\n(embeddings)”] end

subgraph ereshkigal["ereshkigal"]
    pgpt["paperless-gpt :8013\n(auto-tag sidecar)"]
    owui["Open-WebUI :3000\n(LLM chat + RAG)"]
    allm["AnythingLLM :13001\n(RAG chat sidecar)"]
end

subgraph dev["dev shell / nix run"]
    agents["agents package\nsummarize / triage / classify …"]
    llmcli["llm CLI\n-t nix-maint"]
    kiwixask["kiwix-ask\n(grounded retrieval)"]
end

subgraph voile["voile — NAS"]
    kiwix["Kiwix :8092\n(offline Wikipedia / DevDocs / SE)"]
end

pgpt -->|"LLM_MODEL qwen2.5:3b\nVISION_LLM_MODEL qwen2.5vl:7b"| ollama
owui -->|"OLLAMA_BASE_URL\nchat + nomic-embed-text RAG"| ollama
allm -->|"LLM_PROVIDER=ollama\nnomic-embed-text embeddings"| ollama
agents -->|"OLLAMA_HOST"| ollama
llmcli -->|"OLLAMA_HOST → mokou\nqwen2.5-coder:7b"| ollama
kiwixask -->|"retrieve articles"| kiwix
kiwixask -->|"ground answer"| llmcli

```

Components#

Component Host Purpose
Ollama mokou Inference server, GPU-accelerated
paperless-gpt ereshkigal LLM auto-tagging sidecar for Paperless-NGX
Open-WebUI ereshkigal LLM chat + document RAG over mokou’s Ollama (HTTP 3000 / HTTPS 3001)
AnythingLLM ereshkigal RAG chat sidecar over the document library (HTTP 13001 / HTTPS 13002)
agents mokou (CLI) Personal data archeology tools
llm CLI dev shell Offline flake-maintenance assistant (llm -t nix-maint)
llm-maint dev shell llm -t nix-maint with live repo facts (nixpkgs lock, modules, hosts) injected
kiwix-ask dev shell Retrieval-grounded answers from voile’s Kiwix

Key design decisions#

Everything runs locally. Documents contain personal financial, medical, and legal data — sending them to an external API is not acceptable.

mokou provides GPU inference for the whole network. Rather than replicating GPU setup per host, every consumer reaches mokou over Tailscale. The Tailscale FQDN keeps addressing stable regardless of DHCP assignment.

Small models with constrained output. At 3b parameters, reliability comes from forcing the model into a tight output format (single word, JSON schema, CSV) rather than asking for open-ended prose. See the individual pages for prompt patterns.

Offline maintenance, not just offline inference. The same Ollama also backs the llm CLI and kiwix-ask, so the flake itself stays maintainable during a cloud-AI outage: qwen2.5-coder:7b stands in for ChatGPT/Claude, and voile’s offline Kiwix archive stands in for the open web.