The why vs the how: this post is the hands-on architecture walkthrough — the four layers of a self-hosted AI stack, the GPU SKUs, and the OPEX math. The conceptual case for why data sovereignty matters in the first place — the compliance drivers, the GDPR/SOC 2 framing — lives in Self-Hosted AI Infrastructure: Data Sovereignty & Compliance. This piece assumes you've already decided you need it and asks: what exactly are you building?
"Self-hosted AI on our own infrastructure" is a phrase agencies say on every other slide in 2026. Underneath it sit four different architectural layers, and each one carries a choice that determines cost, speed, and how self-hosted you actually are. This piece is about what you're really buying when you say "self-hosted," and what it gets you versus calling a frontier API straight from n8n.
What "self-hosted" actually means
In conversations with agencies, "self-hosted" means four different things depending on who you're talking to. Before discussing architecture, it's worth pinning down definitions — otherwise the argument about cost and timelines talks past itself.
Fully self-hosted on your own GPUs. You rent GPU instances (GCP, AWS, or your own co-located hardware), deploy an open-weights model on them (Qwen, Llama, Mistral), and manage the weights, scaling, and updates yourself. Maximum control, maximum CAPEX/OPEX.
Managed self-hosted. Your cloud provider gives you GPU instances with pre-installed models and an SDK for deploying your own weights. You pick a model from a catalogue or upload your own; the provider takes on the inference infrastructure. It's still "self-hosted" in the sense that data is processed inside your boundary, but without the operational load on your own DevOps team.
A managed API (Gemini, Claude). This isn't self-hosted — it's a managed service with the vendor's own model. Data is processed under the vendor's terms inside the region you select. Under most compliance regimes this covers the majority of cases, except special categories of personal data.
Hybrid. Embedding and retrieval self-hosted inside the boundary, generation through a managed API. The most common pattern in our products: the index holding client PII never leaves the boundary, while the final answer generation goes through a managed service. Fits most production loads without heavyweight DevOps.
From here on, "self-hosted" means the first two. Hybrid and API are separate scenarios with different economics.
The four layers of the stack
A self-hosted AI stack is not "one big box" but four independent layers, each configured separately.
1. Inference layer. A GPU instance or cluster running the deployed LLM. It loads weights, accepts requests, returns tokens. In practice this is either vLLM/Ollama/TGI you run yourself on a GPU instance, or a managed deployment via your cloud's GenAI tooling.
2. Embeddings and retrieval layer. A vector index over the client's document base. This can be pgvector in a managed PostgreSQL, Qdrant in a container service, or a self-managed Milvus install. Vectorization is a separate model (e5, bge-m3, a Qwen embedding model), run on the same GPU instance as inference, or on CPU under light load.
3. Orchestration layer. The logic of "user asks a question → find relevant documents → assemble the prompt → send to the LLM → format the answer." In our products this is n8n inside the boundary (a container service or your own compute). The alternative is custom code on FastAPI/Express, used when you need non-standard integrations or LangChain-style programmatic chains.
4. Data layer. Object storage for source documents, a managed database for metadata, a secrets manager for credentials. This is the part you genuinely need to keep inside your boundary — everything else can sit in managed services.
When someone says "we're building self-hosted AI," the right question is "which of the four layers are self-hosted, and which are managed." Most "sovereign" deployments are in fact hybrids: layers 2 and 4 self-hosted (because that's where client PII lives), layer 1 managed through an API, layer 3 managed through a serverless function runtime.
Choosing GPU instances for inference
GPU is the most expensive component of the stack. The relevant instance classes and what makes sense to run on each:
| Instance class | GPU | VRAM | What fits | Cost order |
|---|---|---|---|---|
| T4-class | NVIDIA T4 | 16 GB | Embeddings (bge-m3, e5), small models (0.5B–3B) | low |
| A10G-class | NVIDIA A10G | 24 GB | Qwen 7B (FP16), Mistral 7B, embeddings + retrieval ranker | medium |
| A100 40GB | NVIDIA A100 | 40 GB | Qwen 14B (FP16) or 32B (Q4), Llama 3 8B (FP16) | high |
| A100 80GB | NVIDIA A100 | 80 GB | Qwen 32B (FP16), Llama 3 70B (Q4) | very high |
These figures are for inference load. For fine-tuning or RAG with a large context, VRAM requirements roughly double.
The working minimum for an agency. For most PR/marketing tasks (comment generation, classification, summarization), a 7B-class open model on an A10G delivers quality indistinguishable from a small frontier model on non-creative tasks. CAPEX is around $1,000–1,500/month for a single instance available 24/7. It's worth comparing against per-token API billing at the same load — self-hosted usually breaks even around 10–15M tokens a month.
When you need an A100. Large context (over 32K tokens), code generation, multimodal (vision) models, complex reasoning. For a typical agency these are rare — an A10G covers 85% of needs.
When you don't need a GPU at all. Embeddings at small volume (under 10K documents in the index) live comfortably on a CPU instance with 16 GB RAM. The reranker, likewise. Only generation and processing very large datasets require a GPU.
Managed API vs self-hosted open weights
The core architectural decision is which model to use. A comparison across the six parameters that matter for agencies:
| Parameter | Managed API (Gemini / Claude) | Qwen / Mistral self-hosted | Llama 3 self-hosted |
|---|---|---|---|
| Quality | Top-tier (frontier models) | High (7B-class ≈ small frontier) | Medium (weaker at the same size) |
| Data residency | Vendor region (you choose) | Your boundary (self-hosted) | Your boundary (self-hosted) |
| Commercial licence | Vendor terms | Apache 2.0 / Apache 2.0 | Llama Community License¹ |
| Cost at typical load² | Per-token | GPU-hours (fixed) | GPU-hours (fixed) |
| Customization (fine-tune) | Limited | Full | Full |
| Vendor lock-in | High | None | None |
¹ The Llama Community License permits commercial use up to a certain DAU threshold — usually not a problem for an agency, but worth re-checking as you scale.
² "Typical load" — around 5M tokens a month for an agency. Self-hosted is fixed at $1,000–1,500/month (one A10G 24/7); a managed API runs around $300–800/month at the same volume. Self-hosted pays off as load grows, or when you need to guarantee zero cross-border data transfer.
The practical choice. For production loads we recommend a hybrid path — a managed frontier API as the "default" inference, and a self-hosted open model on a separate instance for tasks with especially sensitive data or when you need to fine-tune. Pure self-hosted-only makes sense when a regulator is already on the horizon (finance, public-sector contracts) or when DAU is high enough that per-token billing loses even to renting an A100.
n8n as the orchestrator inside the boundary
The orchestration layer is the most underrated component of the stack. Without it you have an LLM and an index, but no product.
We use n8n for every Kelva workflow product — Tender Docs App, Media Comment Generator, Case Study Generator. Why:
- Self-hostable. n8n runs in a container service or on the same instance as the inference model. Workflows never leave the boundary.
- Visual editor. A non-engineer on the client's team (a marketer, a PR director) can read the pipeline and understand what's happening. The "four months after rollout nobody remembers how this works" scenario gets closed by the workflow layout itself.
- HTTP-agnostic. Every step is an HTTP call — to the LLM, to the vector index, to the Google Docs API, to the CRM. Swapping a managed API for a self-hosted open model is editing a URL and headers in one node, not rewriting the orchestrator.
- State + retry without code. Complex scenarios with retry logic, branching, and async waits are described as nodes, not as Python nobody will maintain a year from now.
Alternatives. LangChain/LangGraph, Temporal, custom FastAPI. LangGraph is more often needed for agentic scenarios with multi-step reasoning — for linear pipelines (document → LLM → formatting) it's overkill. Temporal is for production loads with millions of jobs a day, not our case. Custom FastAPI makes sense when n8n can't meet a performance requirement or you need non-standard integrations.
What changes in the economics
The main financial difference between self-hosted and a cloud API is the shift from per-token billing to GPU-hours billing. That changes everything about the operating model.
Per-token (managed APIs). Cost grows linearly with use. Cheap at low DAU, explodes at high DAU. Forecasting cost is a function of unpredictable user input. The upside — zero CAPEX, instant start.
GPU-hours (self-hosted). Cost is fixed — you pay for the instance 24/7 regardless of load. Expensive at low DAU (you pay for an idle GPU), cheap at high DAU (one instance handles any volume within its throughput). Forecasting is precise.
The break-even point. Empirically, for an agency running a 7B-class model on an A10G (~$1,200/month) versus a small frontier API ($0.15/1M input + $0.60/1M output): self-hosted comes out cheaper from roughly 6–8M output tokens a month. That's about 20–30K long requests a month — typical load for a large PR agency with 5–10 active clients.
Hidden self-hosted costs. DevOps maintenance (model updates, GPU monitoring, downtime on upgrades) is realistically 8–12 hours of engineering time a month. Early on this cost gets underestimated.
Hidden API costs. No clean way to budget. One prompt bug that starts dumping huge contexts can burn a month's limit in a week. The growth rate of the bill is disproportionate to the growth rate of the product.
When self-hosted is not the answer
Architectural decisions should open options, not close them. Cases where self-hosted is over-optimization:
- A 2–3-month pilot. CAPEX won't pay back, and DevOps overhead slows iteration. Use a managed API directly, migrate later if the product takes off.
- DAU below 1,000 in the first year. Per-token economics won't catch GPU rental. Self-hosted makes sense as load grows.
- No ML-ops on the team. Maintaining an inference stack is a distinct engineering competence. Without it, self-hosted turns into "the GPU sits idle half the time and nobody knows why."
- No client PII in the flow. Public press releases, open reports, general marketing material — processing through an external API breaks no residency rule. The flexibility and cost of an API win here.
When self-hosted is the only path
And the reverse — cases where self-hosted isn't a choice but a requirement:
- Contracts with regulated-sector or public-sector clients. Most such procurements require localized processing on certified infrastructure. A managed API passes many of these checks, but the most sensitive sectors (defence, healthcare, critical infrastructure) require your own self-hosted instance.
- Special categories of personal data. Health data, biometrics, political views — managed services without specific certification don't meet the requirements.
- Fine-tuned models with proprietary data. If you've trained a model on your own case studies or client data, the weights can't be uploaded to someone else's infrastructure.
- Vendor independence as a strategic requirement. If your client contract includes a clause that "infrastructure must not depend on a single vendor," you're obliged to keep an open-weights fallback. A managed API doesn't satisfy that — you need a self-hosted Qwen or Llama at minimum as a hot standby.
Our Tender Docs App is an example of a product built from the start on a fully self-hosted architecture: documents stored in object storage inside the boundary, the index on self-hosted pgvector inside a managed PostgreSQL, inference as a hybrid of a managed API plus a self-hosted open model, orchestration on n8n in a container service. A prompt carrying client PII never leaves the boundary — because there's physically nowhere for it to go.
Settle the architecture question once
If you're evaluating an AI stack for an agency and want to move from "sovereign AI" slides to a concrete architecture for your scenario — let's talk. We'll assemble the four-layer stack with a clear OPEX, describe the break-even point, and propose concrete instance classes.
