Deploy a Language Model on a Server with No GPU

3-Aug-2025

As someone exploring the frontier of local AI inference, I recently challenged myself to deploy a language model on a budget VPS (3 vCPUs, 8GB RAM) — completely GPU-free — using llama.cpp. The goal? Run a quantized LLM behind an OpenAI-compatible API using just CPU resources. Here’s how I did it, step-by-step.

Why `llama.cpp`?

llama.cpp is a C++ implementation of Meta’s LLaMA models optimized for CPU inference. It supports quantization (GGUF format), multi-threaded execution, and even serves as a drop-in OpenAI API replacement using llama.cpp/server.

It’s fast, minimal, and most importantly — doesn’t require a GPU.

My VPS Specs

I used a small virtual machine from Contabo. It did not cost me much (8$) and the server setup and login time was minimal.

3 vCPUs
8 GB RAM
Ubuntu 22.04 LTS
No GPU

This spec is tight, but with quantized models, it’s just enough to serve lightweight LLM tasks.

Install Dependencies

sudo apt update
sudo apt install -y build-essential cmake git

Then clone and build llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

This builds llama.cpp

Download a Quantized Model (GGUF)

For CPU use, you want a small quantized model. I chose a 7B parameter model quantized to Q4_0 using GGUF format.

Sites like:

...provide many options. Download and extract it into a models/ folder.

mkdir models
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_0.gguf -O models/mistral.gguf

Run the `llama.cpp` Server

Start the server like so:

./server -m models/mistral.gguf --host 0.0.0.0 --port 8000

This starts an OpenAI-compatible endpoint at:

http://<your-vps-ip>:8000/v1/completions

You can also use --n-thread to control how many CPU threads to use.

Test the Endpoint

Using curl:

curl http://<your-vps-ip>:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "mistral",
    "prompt": "Define 'curl'",
    "max_tokens": 1000,
    "temperature": 0.7
  }'

Or connect via tools like Postman, LangChain, or any OpenAI client.

Performance Observations

With 3 CPUs and 8GB RAM:

A 4-bit quantized 7B model runs comfortably, though slower than GPU inference.
Latency per response: ~1–3 seconds for small completions.
Perfect for prototyping, chatbots, or even offline assistants with lightweight traffic.

Bonus: Securing the Endpoint

I used Nginx as a reverse proxy with HTTPS and llama.cpp api_key, but you could also use:

caddy with JWT or API key middleware
Docker + Traefik
localhost + ssh tunnel for private use

Running your own LLM API on a small VPS is not only possible — it's surprisingly practical with tools like llama.cpp. You control your model, your data, and your costs — no API fees, no throttling.

This setup won’t beat OpenAI’s GPT-4, but for basic completions, assistants, and custom apps, it’s powerful and private.

Why llama.cpp?