Running a Local AI Model on My Homelab Kubernetes Cluster

Adding GPU-accelerated AI inference to a k3s cluster using Ollama, a GTX 1050 Ti, and Open WebUI — no cloud, no subscriptions, no data leaving the network.


Local AI abstract art

Why Run AI Locally?

Cloud AI services are convenient, but they come with recurring costs, rate limits, and the fact that every prompt you send leaves your network. For a homelab, running a local model means full control — the data stays on your machines, it works offline, and once it's set up, it costs nothing to run.

I already had a k3s cluster built from old phones and a laptop handling monitoring and web apps. Adding local AI inference was the logical next step — I just needed a GPU.


The Hardware

My daily driver Windows PC had a spare GPU doing nothing most of the day:

The GTX 1050 Ti is far from a data center GPU, but 4GB VRAM is enough to run quantized 7B parameter models at reasonable speed. That's roughly the intelligence of early ChatGPT — more than sufficient for a DevOps study assistant, code helper, or general Q&A tool.


Architecture

Windows PC (192.168.100.203)
  └── WSL2 (Ubuntu 24.04)
        ├── Ollama (GPU inference, port 11434)
        └── NVIDIA GPU passthrough via /dev/dxg

k3s Cluster (192.168.100.52)
  ├── Open WebUI pod → talks to Ollama API
  ├── MetalLB → exposes WebUI at 192.168.100.204
  └── k3s ExternalName service → routes to Ollama

The trick is that WSL2 doesn't expose /dev/nvidia* like native Linux — it uses DirectX's /dev/dxg for GPU passthrough. This means the standard NVIDIA k8s device plugin won't work. Instead, Ollama runs directly in WSL2 where it can access the GPU natively, and k3s pods communicate with it over the LAN.


Step 1: Enable WSL2 with GPU Support

WSL2 needs Hyper-V enabled. If you have VirtualBox installed, you may need to enable it explicitly:

# PowerShell as Administrator
dism /online /enable-feature /featurename:Microsoft-Hyper-V-All /all /norestart

Restart Windows, then install Ubuntu inside WSL2:

wsl --install -d Ubuntu-24.04

After creating a user, verify GPU passthrough works:

# Inside WSL2 Ubuntu
nvidia-smi

You should see your GTX 1050 Ti. If nvidia-smi is not found, update your NVIDIA Windows driver to the latest version — WSL2 GPU support requires driver 470+.


Step 2: Install Ollama

Inside WSL2 Ubuntu:

curl -fsSL https://ollama.com/install.sh | sh

By default Ollama listens on localhost only. To make it accessible from the k3s cluster, configure it to listen on all interfaces:

sudo systemctl stop ollama
sudo mkdir -p /etc/systemd/system/ollama.service.d
echo -e '[Service]\nEnvironment="OLLAMA_HOST=0.0.0.0"' | sudo tee /etc/systemd/system/ollama.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl start ollama

Pull a model:

ollama pull mistral

Mistral 7B (Q4 quantized) fits comfortably in 4GB VRAM, using about 3.2GB. It's fast, good at code, and solid for technical conversations.


Step 3: Expose Ollama to the LAN

WSL2 has its own virtual network — pods running on k3master can't reach WSL2's internal IP directly. We need to forward port 11434 from Windows to WSL2.

PowerShell as Administrator:

# Get WSL2's internal IP
wsl -d Ubuntu-24.04 hostname -I

# Forward the port (replace 172.x.x.x with the IP from above)
netsh interface portproxy add v4tov4 listenport=11434 listenaddress=0.0.0.0 connectport=11434 connectaddress=172.25.103.186

# Allow through firewall
netsh advfirewall firewall add rule name="Ollama" dir=in action=allow protocol=TCP localport=11434

Verify from k3master:

curl http://192.168.100.203:11434/api/generate \
  -d '{"model":"mistral","prompt":"Hello","stream":false}' \
  -H "Content-Type: application/json"

You should get a JSON response with the model's reply.


LLama abstract art

Step 4: Deploy Open WebUI on k3s

Open WebUI gives you a ChatGPT-like interface that talks to Ollama. Deploy it as a k3s pod with a MetalLB LoadBalancer:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      nodeSelector:
        kubernetes.io/hostname: k3master
      containers:
      - name: open-webui
        image: ghcr.io/open-webui/open-webui:main
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: "http://192.168.100.203:11434"
        resources:
          requests:
            memory: 1Gi
          limits:
            memory: 3Gi
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui
  annotations:
    metallb.universe.tf/loadBalancerIPs: "192.168.100.204"
spec:
  type: LoadBalancer
  selector:
    app: open-webui
  ports:
  - port: 80
    targetPort: 8080

Apply and wait for it to start:

sudo kubectl apply -f open-webui.yaml
sudo kubectl get pods -w

Note: Open WebUI's container image is about 2GB and it downloads additional files on first startup. Give it a few minutes and make sure the memory limit is at least 3Gi — it will get OOMKilled at 1Gi.

Once running, open http://192.168.100.204 in your browser. Create an account (it's local, no external service), and start chatting with Mistral.


What Fits in 4GB VRAM?

Not every model will fit on a GTX 1050 Ti. Here's what works:

Model Parameters VRAM Usage Good For
Mistral 7B (Q4) 7B ~3.2GB General chat, code, technical
Phi-3 Mini 3.8B ~2.5GB Fast responses, studying
CodeLlama 7B (Q4) 7B ~3.2GB Code generation, debugging
Deepseek Coder 6.7B 6.7B ~3.2GB Infrastructure and code
Llama 3.1 8B (Q4) 8B ~3.8GB General assistant

To try a different model:

# Inside WSL2
ollama pull phi3
ollama pull codellama

Then select it from the model dropdown in Open WebUI.


Key Decisions

WSL2 instead of native Linux — The Windows PC is my daily driver. Dual-booting would give better GPU performance but would mean choosing between Windows and the AI server. WSL2 lets both coexist.

Ollama outside k3s — WSL2's GPU passthrough uses /dev/dxg instead of /dev/nvidia*, which means the standard NVIDIA k8s device plugin can't detect the GPU. Running Ollama directly in WSL2 and exposing it as a LAN service is simpler and more reliable.

Port forwarding with netsh — WSL2's virtual network is isolated from the physical LAN. The netsh interface portproxy command bridges the gap without needing additional software.

3Gi memory for Open WebUI — The container downloads embedding models and builds internal indexes on first start. With only 1Gi, it gets killed by the OOM killer before finishing initialization.


The Complete Homelab

With this addition, the homelab now runs four services across two machines:

Service URL Machine Purpose
Grafana http://192.168.100.201 k3master (laptop) Cluster monitoring
Guitar App http://192.168.100.202 k3master (laptop) Audio stem separation
Ollama API http://192.168.100.203:11434 Windows PC (GPU) LLM inference
Open WebUI http://192.168.100.204 k3master (laptop) AI chat interface

Four nodes, 32 ARM cores, an i5 with a GPU, about 30GB of RAM total — all running Kubernetes workloads from monitoring to AI inference, without spending a dollar on cloud services.


Credits