Running a Local AI Model on My Homelab Kubernetes Cluster
Adding GPU-accelerated AI inference to a k3s cluster using Ollama, a GTX 1050 Ti, and Open WebUI — no cloud, no subscriptions, no data leaving the network.
Why Run AI Locally?
Cloud AI services are convenient, but they come with recurring costs, rate limits, and the fact that every prompt you send leaves your network. For a homelab, running a local model means full control — the data stays on your machines, it works offline, and once it's set up, it costs nothing to run.
I already had a k3s cluster built from old phones and a laptop handling monitoring and web apps. Adding local AI inference was the logical next step — I just needed a GPU.
The Hardware
My daily driver Windows PC had a spare GPU doing nothing most of the day:
- CPU: Intel i5-10400F — 6 cores, 12 threads
- RAM: 32GB DDR4
- GPU: NVIDIA GeForce GTX 1050 Ti — 4GB VRAM
The GTX 1050 Ti is far from a data center GPU, but 4GB VRAM is enough to run quantized 7B parameter models at reasonable speed. That's roughly the intelligence of early ChatGPT — more than sufficient for a DevOps study assistant, code helper, or general Q&A tool.
Architecture
Windows PC (192.168.100.203)
└── WSL2 (Ubuntu 24.04)
├── Ollama (GPU inference, port 11434)
└── NVIDIA GPU passthrough via /dev/dxg
k3s Cluster (192.168.100.52)
├── Open WebUI pod → talks to Ollama API
├── MetalLB → exposes WebUI at 192.168.100.204
└── k3s ExternalName service → routes to Ollama
The trick is that WSL2 doesn't expose /dev/nvidia* like
native Linux — it uses DirectX's /dev/dxg for GPU
passthrough. This means the standard NVIDIA k8s device plugin won't
work. Instead, Ollama runs directly in WSL2 where it can access the GPU
natively, and k3s pods communicate with it over the LAN.
Step 1: Enable WSL2 with GPU Support
WSL2 needs Hyper-V enabled. If you have VirtualBox installed, you may need to enable it explicitly:
# PowerShell as Administrator
dism /online /enable-feature /featurename:Microsoft-Hyper-V-All /all /norestart
Restart Windows, then install Ubuntu inside WSL2:
wsl --install -d Ubuntu-24.04
After creating a user, verify GPU passthrough works:
# Inside WSL2 Ubuntu
nvidia-smi
You should see your GTX 1050 Ti. If nvidia-smi is not
found, update your NVIDIA Windows driver to the latest version — WSL2
GPU support requires driver 470+.
Step 2: Install Ollama
Inside WSL2 Ubuntu:
curl -fsSL https://ollama.com/install.sh | sh
By default Ollama listens on localhost only. To make it accessible from the k3s cluster, configure it to listen on all interfaces:
sudo systemctl stop ollama
sudo mkdir -p /etc/systemd/system/ollama.service.d
echo -e '[Service]\nEnvironment="OLLAMA_HOST=0.0.0.0"' | sudo tee /etc/systemd/system/ollama.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl start ollama
Pull a model:
ollama pull mistral
Mistral 7B (Q4 quantized) fits comfortably in 4GB VRAM, using about 3.2GB. It's fast, good at code, and solid for technical conversations.
Step 3: Expose Ollama to the LAN
WSL2 has its own virtual network — pods running on k3master can't reach WSL2's internal IP directly. We need to forward port 11434 from Windows to WSL2.
PowerShell as Administrator:
# Get WSL2's internal IP
wsl -d Ubuntu-24.04 hostname -I
# Forward the port (replace 172.x.x.x with the IP from above)
netsh interface portproxy add v4tov4 listenport=11434 listenaddress=0.0.0.0 connectport=11434 connectaddress=172.25.103.186
# Allow through firewall
netsh advfirewall firewall add rule name="Ollama" dir=in action=allow protocol=TCP localport=11434
Verify from k3master:
curl http://192.168.100.203:11434/api/generate \
-d '{"model":"mistral","prompt":"Hello","stream":false}' \
-H "Content-Type: application/json"
You should get a JSON response with the model's reply.
Step 4: Deploy Open WebUI on k3s
Open WebUI gives you a ChatGPT-like interface that talks to Ollama. Deploy it as a k3s pod with a MetalLB LoadBalancer:
apiVersion: apps/v1
kind: Deployment
metadata:
name: open-webui
spec:
replicas: 1
selector:
matchLabels:
app: open-webui
template:
metadata:
labels:
app: open-webui
spec:
nodeSelector:
kubernetes.io/hostname: k3master
containers:
- name: open-webui
image: ghcr.io/open-webui/open-webui:main
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
env:
- name: OLLAMA_BASE_URL
value: "http://192.168.100.203:11434"
resources:
requests:
memory: 1Gi
limits:
memory: 3Gi
---
apiVersion: v1
kind: Service
metadata:
name: open-webui
annotations:
metallb.universe.tf/loadBalancerIPs: "192.168.100.204"
spec:
type: LoadBalancer
selector:
app: open-webui
ports:
- port: 80
targetPort: 8080
Apply and wait for it to start:
sudo kubectl apply -f open-webui.yaml
sudo kubectl get pods -w
Note: Open WebUI's container image is about 2GB and it downloads additional files on first startup. Give it a few minutes and make sure the memory limit is at least 3Gi — it will get OOMKilled at 1Gi.
Once running, open http://192.168.100.204 in your browser. Create an account (it's local, no external service), and start chatting with Mistral.
What Fits in 4GB VRAM?
Not every model will fit on a GTX 1050 Ti. Here's what works:
| Model | Parameters | VRAM Usage | Good For |
|---|---|---|---|
| Mistral 7B (Q4) | 7B | ~3.2GB | General chat, code, technical |
| Phi-3 Mini | 3.8B | ~2.5GB | Fast responses, studying |
| CodeLlama 7B (Q4) | 7B | ~3.2GB | Code generation, debugging |
| Deepseek Coder 6.7B | 6.7B | ~3.2GB | Infrastructure and code |
| Llama 3.1 8B (Q4) | 8B | ~3.8GB | General assistant |
To try a different model:
# Inside WSL2
ollama pull phi3
ollama pull codellama
Then select it from the model dropdown in Open WebUI.
Key Decisions
WSL2 instead of native Linux — The Windows PC is my daily driver. Dual-booting would give better GPU performance but would mean choosing between Windows and the AI server. WSL2 lets both coexist.
Ollama outside k3s — WSL2's GPU passthrough uses
/dev/dxg instead of /dev/nvidia*, which means
the standard NVIDIA k8s device plugin can't detect the GPU. Running
Ollama directly in WSL2 and exposing it as a LAN service is simpler and
more reliable.
Port forwarding with netsh — WSL2's virtual network is
isolated from the physical LAN. The
netsh interface portproxy command bridges the gap without
needing additional software.
3Gi memory for Open WebUI — The container downloads embedding models and builds internal indexes on first start. With only 1Gi, it gets killed by the OOM killer before finishing initialization.
The Complete Homelab
With this addition, the homelab now runs four services across two machines:
| Service | URL | Machine | Purpose |
|---|---|---|---|
| Grafana | http://192.168.100.201 | k3master (laptop) | Cluster monitoring |
| Guitar App | http://192.168.100.202 | k3master (laptop) | Audio stem separation |
| Ollama API | http://192.168.100.203:11434 | Windows PC (GPU) | LLM inference |
| Open WebUI | http://192.168.100.204 | k3master (laptop) | AI chat interface |
Four nodes, 32 ARM cores, an i5 with a GPU, about 30GB of RAM total — all running Kubernetes workloads from monitoring to AI inference, without spending a dollar on cloud services.
Credits
- Ollama — Local LLM runtime with GPU acceleration
- Open WebUI — Self-hosted ChatGPT-like interface
- Mistral AI — The Mistral 7B model