One Key, One IP: A Self-Hosted AI Gateway for a Fleet of Agents

I have a small fleet of things that talk to AI providers. Coding agents, a couple of
migration assistants, some marketing automation, a handful of throwaway scripts that turned
out not to be throwaway. They run on different machines — a Mac that does the heavy lifting, a
Linux box for automation, a server or two — and over time each of them grew its own little pile
of API keys: an OpenAI key here, an OpenRouter key there, an Azure endpoint and key pasted into
yet another .env.

Two problems showed up, and they got worse the more agents I added.

Key sprawl. Every new agent meant copying the same secrets into another config file on
another machine. Rotating a key meant hunting down every place it lived. There was no single
answer to "what is actually allowed to spend my money, and how much."

Scattered egress. Each machine called the providers from its own IP address. From the
provider's point of view, one "account" was suddenly making requests from a Mac on one network,
a laptop on another, and a server in a datacenter — bouncing between addresses and regions. That
is exactly the pattern that gets you rate-limited, challenged, or quietly flagged. Consistency
matters to the people on the other end of the API, and I had none.

Both problems have the same shape, and therefore the same fix: stop letting every agent talk to
the providers directly. Put one box in the middle. It holds the keys, and every call goes out
through it — so there's one place to manage secrets and one IP the providers ever see.

This builds directly on the self-hosted SSH mesh
I wrote about earlier: because all my machines already share a private overlay network, the gateway
can live on that network and never touch the public internet at all. If you haven't set up a mesh,
that post is the prerequisite; this one assumes every machine can already reach a private address
like 100.x.y.z.

The shape of the solution

There are two kinds of traffic to handle, and they want two different tools.

LLM calls — chat and embeddings against OpenAI, OpenRouter, Azure OpenAI, and so on. For
these I want real consolidation: the keys live on the gateway, agents authenticate with one
key, and I get budgets and usage tracking for free. The tool for this is LiteLLM, an
OpenAI-compatible proxy.
Everything else — text-to-speech, image generation, cloud vision/speech services, any
random HTTPS API. Here I don't need to centralize the key (there are few callers and they
already have their own); I just need the request to leave from the gateway's IP. The tool for
this is a plain forward proxy — I used tinyproxy.

Both bind to the gateway's private mesh address only. Nothing listens on the public internet.
That single decision removes an entire category of "someone found my open proxy" problems.

Part 1: LiteLLM, the unified LLM endpoint

LiteLLM speaks the OpenAI API and forwards to whatever backend you configure. Install it in a
virtualenv (no Docker needed) and point a systemd unit at it.

python3 -m venv /opt/litellm/venv
/opt/litellm/venv/bin/pip install 'litellm[proxy]'

The config is a YAML file describing the models you want to expose. The nicest trick is the
OpenRouter passthrough — a single wildcard entry that lets agents reach any model OpenRouter
offers (OpenAI, Anthropic, Google, Mistral…) through one key:

model_list:
  # One key, every vendor OpenRouter carries.
  - model_name: "openrouter/*"
    litellm_params:
      model: "openrouter/*"
      api_key: os.environ/OPENROUTER_API_KEY

  # A named Azure OpenAI deployment.
  - model_name: gpt-4.1-mini
    litellm_params:
      model: azure/gpt-4.1-mini
      api_base: https://YOUR-RESOURCE.openai.azure.com
      api_key: os.environ/AZURE_OPENAI_KEY
      api_version: "2025-01-01-preview"

  # An Azure AI serverless model (DeepSeek), see the gotcha below.
  - model_name: deepseek-v31
    litellm_params:
      model: azure_ai/deepseek-v31
      api_base: https://YOUR-REGION.api.cognitive.microsoft.com/models
      api_version: "2024-05-01-preview"
      api_key: os.environ/AZURE_FOUNDRY_KEY

litellm_settings:
  drop_params: true
  request_timeout: 600

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

Secrets go in a separate .env (referenced by os.environ/..., never inline), and the systemd
unit binds the proxy to the mesh interface, not 0.0.0.0:

[Service]
EnvironmentFile=/etc/litellm/.env
ExecStart=/opt/litellm/venv/bin/litellm --config /etc/litellm/config.yaml \
          --host 100.x.y.z --port 4000
Restart=always

That --host 100.x.y.z is the whole security model in one line: the service is only reachable from
machines on the mesh. From any of them, an agent now does this and nothing else:

export OPENAI_BASE_URL=http://100.x.y.z:4000/v1
export OPENAI_API_KEY=<the gateway key>

Any client that speaks the OpenAI API — the official SDKs, LangChain, your own curl — just works.
No provider keys on the agent's machine at all.

Part 2: tinyproxy, the fixed-IP forward proxy

For the non-LLM traffic, the forward proxy is almost embarrassingly simple. Install tinyproxy,
bind it to the mesh address, and allow only the mesh subnet:

Port 8888
Listen 100.x.y.z
Allow 100.64.0.0/10
ConnectPort 443          # permit HTTPS CONNECT tunnels

Now any tool on any mesh machine can route its traffic out the gateway with two environment
variables:

export HTTPS_PROXY=http://100.x.y.z:8888
export HTTP_PROXY=http://100.x.y.z:8888

A TTS call, an image-generation request, a cloud vision API — all of them now leave from the
gateway's single IP, while keeping their own keys client-side. To prove it, check your address
before and after:

curl https://api.ipify.org                          # your machine's IP
curl -x http://100.x.y.z:8888 https://api.ipify.org # the gateway's IP

One important rule: never expose a forward proxy on the public internet. An open proxy is an
open relay, and bots will find it in hours. Mesh-only is not a limitation here — it's the point.

The two gotchas that cost me time

A how-to is only useful if it includes the parts that didn't work on the first try. There were two.

Azure AI serverless models don't live where you'd guess

Wiring Azure OpenAI (gpt-4.1-mini) was painless. The serverless Azure AI model — DeepSeek, in
my case — was not. The obvious per-resource hostname, your-resource.services.ai.azure.com,
doesn't resolve unless the resource has a custom subdomain configured, and mine didn't:

Cannot connect to host your-resource.services.ai.azure.com:443 [Name or service not known]

The fix is to use the regional model-inference endpoint with a /models path and an explicit
API version. You can confirm the right shape with a direct call before you ever touch LiteLLM:

curl "https://YOUR-REGION.api.cognitive.microsoft.com/models/chat/completions?api-version=2024-05-01-preview" \
  -H "api-key: $AZURE_FOUNDRY_KEY" -H "Content-Type: application/json" \
  -d '{"model":"deepseek-v31","messages":[{"role":"user","content":"hi"}]}'

Once that returns 200, point LiteLLM's azure_ai/ provider at the same .../models base (as in
the config above) and it works.

The admin UI needs a database — and a package that isn't bundled

LiteLLM ships a genuinely nice admin UI at /ui: create per-agent keys, set budgets, watch spend,
read request logs. But none of that works without a Postgres database, because that's where it
stores keys and usage. With no DB, the page loads but you can't log in.

So: install Postgres, create a database, and add three things to your .env:

DATABASE_URL=postgresql://litellm:PASSWORD@localhost:5432/litellm
UI_USERNAME=admin
UI_PASSWORD=<something strong>
STORE_MODEL_IN_DB=True

Restart and… it still failed for me, with ModuleNotFoundError: No module named 'prisma'. LiteLLM's
database layer uses Prisma, but the prisma Python package isn't pulled in by litellm[proxy], and
the tables aren't created automatically. Two commands fix both:

/opt/litellm/venv/bin/pip install prisma
# point this at LiteLLM's bundled schema.prisma:
/opt/litellm/venv/bin/prisma db push --schema <path/to/litellm>/schema.prisma

(The db push finishes by trying a code-gen step that may error with prisma-client-py: not found
— that part is harmless, the tables are already created by then.) After a restart, the UI logs in,
minting a virtual key writes to the DB, and you get real per-agent budgets and a spend dashboard.

What I ended up with

A single box on my private network that:

exposes one OpenAI-compatible endpoint for every LLM provider, so agents carry one key (or a
per-agent key with its own budget) instead of a pile of provider secrets;
routes everything else through a forward proxy from the same fixed IP, so providers see
one consistent origin instead of a dozen scattered ones;
listens only on the mesh, so there's nothing to attack from the internet;
and gives me a dashboard to see, per agent, exactly what's being spent.

It's all native packages and systemd — no containers, no orchestration, an afternoon of work. The
payoff was immediate: the next time I spun up an agent, "configure the AI provider" stopped being a
step. It just inherits the gateway, like every other machine on the mesh.

If you're past the point of one script with one API key — if you've got a fleet — this is the
piece that makes the fleet feel like one thing instead of a dozen separate liabilities.