Skip to main content

Command Palette

Search for a command to run...

Alright then...adding vLLM to the mix

Published
4 min read
Alright then...adding vLLM to the mix
A

Exploring AI in my free time while holding down a job working with startups at one of the Magnificent Seven

A few months ago I started using my DGX Spark, and coming in with minimal experience of running local models, I wanted a simple and easy to understand inference engine (as I wrote in this article). Plus I could see the experts on the DGX Spark user group spending hours getting vLLM to work properly, and they were going into details which were waaay beyond my league. So I installed llama.cpp, which has worked really well for my needs so far.

But recently I noticed that some of the things I wanted to do on my Spark were limited by the fact I only use llama.cpp. For example, the consensus on the NVIDIA user group is that AWQ quantisations are fastest on DGX Spark but these are mostly not available as GGUF. And the latest models usually take time to make it into GGUF format. Most of the users on the group seem to use vLLM so I felt like I should explore it for at least some of my use cases - in particular I am planning my first LoRA training, with a longish data extraction run required, and I wanted to use the most robust solution.

It was time for a model cleanup anyway, so I cleared out a bunch of stale GGUFs which weren’t being used, set aside a few hours on a rainy afternoon and went ahead and installed vLLM. Luckily the DGX spark community has made great progress in stabilising the container image since the Spark launched, so the process of installing is now relatively easy:

  1. A power contributor in the user forums (eugr) has created a repo with the container image and installation instructions. Your first step should be to install this according to his README.

  2. On that repo are recipes for commonly used models which will launch with preconfigured default parameters.

  3. I wanted to start with my poster project so installed the container and set it to run with: launch-cluster.sh --solo exec vllm serve QuantTrio/Qwen3-VL-32B-Instruct-AWQ --port 8004 --host 0.0.0.0 --gpu-memory-utilization 0.5 --load-format fastsafetensors --max_model_len 32000

vLLM gives you speed but consumes a lot of system memory!

I started by testing Qwen3-VL which I am using to analyse the text on poster images.

I first used the AWQ quantisation of the 32B parameter version of this model, which did a stellar job analysing poster text layout and style, and returning this successfully in JSON. As you’ll see above in the run command I restricted the memory utilisation, but it still used 76GB of system RAM, so this isn’t really an approach that will work well to run multiple models at the same time (the model itself is 20GB). The results are below - the analysis took 76 seconds and all the text was successfully extracted.

I also tried the same analysis using llama.cpp and the Unsloth 8 bit quantisation (with F32 mmproj file). The launch command was {llama-dir}/llama-server -m {model-dir}/Qwen/Qwen3-VL-8B-Thinking-Q8_0.gguf -c 32000 -b 8096 --no-mmap --jinja --flash-attn on --kv-unified --mmproj {model-dir}/Qwen/mmproj-F32.gguf -ctk q4_0 -ctv q4_0 -ngl 999 -ub 8096 --host 0.0.0.0 --port 8004.

Llama.cpp required a lot less system memory:

But there were some significant downsides:

  1. The speed of analysis was a lot longer - for the image above the time required was 142 seconds (twice as long as vLLM running a 4x larger model).

  2. It missed the word wrap on the main poster headline which is a problem for my use case.

  3. During inference of a second image - not particularly complicated from a text extraction perspective - my Spark completely crashed to black with what I assume was an OOM error. This had also happened previously when running inference on a very wordy poster image using llama.cpp. And my Spark was extremely hot.

My conclusion from this is that vLLM manages the inference process more gracefully for this use case. For the Glastonbury poster, even though it took 402s, it was able to manage the longer inference and produce an output successfully, possibly due to the reserved memory parameter which would avoid Out of Memory problems. Llama.cpp by contrast crashed my system. I subsequently sent multiple images through the vLLM model with no problem, so that will be my approach to extract the data for the LoRA training I am planning.