As someone exploring the frontier of local AI inference, I recently challenged myself to deploy a language model on a budget VPS (3 vCPUs, 8GB RAM) — completely GPU-free — using llama.cpp
. The goal? Run a quantized LLM behind an OpenAI-compatible API using just CPU resources. Here’s how I did it, step-by-step.
llama.cpp
?llama.cpp
is a C++ implementation of Meta’s LLaMA models optimized for CPU inference. It supports quantization (GGUF format), multi-threaded execution, and even serves as a drop-in OpenAI API replacement using llama.cpp/server
.
It’s fast, minimal, and most importantly — doesn’t require a GPU.
I used a small virtual machine from Contabo. It did not cost me much (8$) and the server setup and login time was minimal.
This spec is tight, but with quantized models, it’s just enough to serve lightweight LLM tasks.
sudo apt update
sudo apt install -y build-essential cmake git
Then clone and build llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
This builds llama.cpp
For CPU use, you want a small quantized model. I chose a 7B parameter model quantized to Q4_0 using GGUF format.
Sites like:
...provide many options. Download and extract it into a models/
folder.
mkdir models
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_0.gguf -O models/mistral.gguf
llama.cpp
ServerStart the server like so:
./server -m models/mistral.gguf --host 0.0.0.0 --port 8000
This starts an OpenAI-compatible endpoint at:
http://<your-vps-ip>:8000/v1/completions
You can also use --n-thread
to control how many CPU threads to use.
Using curl
:
curl http://<your-vps-ip>:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "mistral",
"prompt": "Define 'curl'",
"max_tokens": 1000,
"temperature": 0.7
}'
Or connect via tools like Postman, LangChain, or any OpenAI client.
With 3 CPUs and 8GB RAM:
I used Nginx as a reverse proxy with HTTPS and llama.cpp api_key, but you could also use:
caddy
with JWT or API key middlewareRunning your own LLM API on a small VPS is not only possible — it's surprisingly practical with tools like llama.cpp
. You control your model, your data, and your costs — no API fees, no throttling.
This setup won’t beat OpenAI’s GPT-4, but for basic completions, assistants, and custom apps, it’s powerful and private.