Setting up a local coding assistant on DGX Spark (4 months and still trying)

Coding with a local AI was something I was looking forward to when I bought my DGX Spark. I work alongside an Agentic Engineering specialist, so I’m well aware of how this can accelerate coding (and slow it down, depending on what practices you adopt). However getting it working acceptably on my DGX Spark was surprisingly challenging relative to using an Enterprise solution like Kiro, Codex or Claude Code.

In my day job I am spoilt with reliable MCP usage, insanely long context windows, the latest models and a bunch of really smart people around me with great ideas for using agentic coding assistants. The reality of what could/would work locally is totally different.

The Journey

On the suggestion of a colleague (and because it’s in the NVIDIA DGX Spark projects) Continue.dev was my first attempt. I tested the VSCode plugin with GPT-OSS:120B and Qwen3-Coder:30B using Ollama. Agentic actions were inconsistent and often failed entirely.

Ollama UI with Qwen3-coder:30B worked better. It was functional and fast, but tedious without autonomous task execution (everything was copy & paste).

I then tried Roo Code (a Cline fork) and this worked reasonably well, accepting OpenAI-compatible endpoints, including models running on llama.cpp. But I still could not get the performance or predictability I wanted from running models locally, and agentic actions were really unreliable.

I also tried llama-vscode but just could not work out how to configure and use it - this one I installed and uninstalled three times before giving up!

I tried Aider. I quite liked it as the system prompt is small and doesn’t take up too many valuable tokens, it also uses git diffs as a foundation which is useful when it does unintended things which you can easily roll back. But again I could not get agents to work, and found that Aider is very coding-task focused, lacking the planning & architectural side of things.

I tried the Claude Code VSCode plug-in but couldn’t get it to work without an Anthropic account or with a local LLM. I tried VSCode’s GitHub CoPilot extension too (which you have to explicitly turn off if you don’t want to use). I tried Claude Code CLI in a terminal window but was scratching my head to configure MCP and my models, plus it was slow. I tried to get Kiro CLI working but couldn’t get it working on my ARM64.

So after a couple of months I decided to give it a break. Maybe it was the models themselves which weren’t ready?

The Good Enough Workaround

As a workaround I started using a rotation of ChatGPT, Claude & Gemini for architectural and complex work, until my (free plan) tokens were consumed, and Qwen-3-Coder via llama-server’s Web UI for simpler requirements, like updating or creating html files.

Sometimes I would spin up Aider at the same time and access it via a terminal in VSCode running Qwen3-Coder-30B-3B 4 bit AWQxGGUF quant on llama.cpp.

This was OK, if a bit of a hacky cut & paste experience, mainly because I only get to play with my DGX Spark outside working hours, so the free limits are enough for the time I have available.

Back to Roo Code with vLLM (failed)

In February I added vLLM to my environment, and started to see better performance when running models this way. I thought it might be time to revisit this when I went back to the Roo Code website and saw they have easy MCP installation, plus I had just downloaded Qwen-3-Coder-Next FP8 to run in vLLM. I reinstalled the Roo Code VSCode extension in VSCode. It worked - OK. But it was quite slow, got confused numerous times and I didn’t fully trust it. If you look at the current Roo Code extension reviews, it seems a mixed bag. And the same goes for Continue.dev. There’s obviously still a massive difference between coding agents using frontier models on high spec GPUs and those smaller open source models trying to work on a local GPU - even if it is NVIDIA Blackwell.

This is what Open Web UI says about local models:

KISS applies in this domain (Keep it Simple Stupid!)

What I have noticed when trying to run models locally is that as soon as a massive system prompt and agent orchestration are involved - for example when you are running them via an IDE extension, they invariably slow down and get confused. The best performance I have seen is when I’ve run them in a web UI kind of environment where I have controlled the sequence of questions and steps.

And giving them not too challenging tasks - do not expect a local LLM to refactor across multiple code modules and get it right.

For now I have decided to continue with the web UI-based approach. As a partial upgrade to my system I have installed Open Web UI and connected to my coding model (Qwen3-Coder-Next-FP8 at the moment) running in vLLM. This is a big model with 80B parameters but uses Mixture of Experts with 3B activated, so it operates at a reasonable speed of 17 tokens/s and the Open WebUI is really nice to use:

It’s a decent model but it can’t cope with a large codebase, lots of modules or following a complicated train of thought. It’s still - sadly - better to reserve your ChatGPT, Claude, Gemini free tier tokens for the more difficult work.