Training a LoRA with Unsloth on DGX Spark

Having done a long detour through manually reviewing and finessing my training data, this weekend I was finally ready to train the LoRAs for my poster generation project. I have two planned:

One to generate the image prompt for Flux Klein, based on the user's input.
One to generate the text layout based on the text the user has entered (bounding box size & position for each text element).

I decided to use Unsloth, for no particular reason other than I knew of them and a colleague had suggested they work well and are faster to train. Initially I tried to use the Unsloth DGX Spark guide which uses their docker container but this crashed. I then tried the NVIDIA version but this also did not work.

What did work was installing the newly released Unsloth Studio following the instructions in their QuickStart. Unfortunately it does download llama.cpp which I already have on my system, but I figure it's not too big and might be useful in the future to use llama.cpp from this environment.

I needed to install Flash Attention into this environment, which takes hours. The instructions for doing this on a DGX Spark are in this post: Running vLLM-Omni for Qwen3-TTS(voice design, voice clone) on DGX Spark. You can confine the install to DGX Spark Blackwell architecture:

export FLASH_ATTN_CUDA_ARCHS=“120”

Training Data

If your training data is not in the correct format, you can use Unsloth Studio's data recipes to refine it. I used Claude Code to extract the subset of data needed to train the LoRA and put it in the correct format. I am training text-to-text so have instruction/output pairs in the alpaca instruction format: {"instruction": "", "input": "", "output": ""}. You can read more about Unsloth data formats on their website. Below is an example of a training data record in my data set.

{
    "instruction": {
      "canvas_width": 1028,
      "canvas_height": 1387,
      "orientation": "portrait",
      "user_subject": "An elegant poster for a dairy brand featuring a woman and a cow, emphasizing the freshness of their milk.",
      "design_brief": "The style should be strictly Art Nouveau with ornate borders and dominant curved forms, capturing the sophisticated Belle
  Époque aesthetic.",
      "typography_requirements": "Layout contains: dominant display text (headline): 6 words, 2 line(s), ~34 chars; dominant display text (headline):
   7 words, 3 line(s), ~47 chars; dominant display text (headline): 9 words, 1 line(s), ~57 chars; dominant display text (headline): 1 words, 1
  line(s), ~8 chars."
    },
    "input":"",
    "output": {
    "final_image_prompt": "Art Nouveau lithograph poster illustration, early 20th century. A young girl with a soft bob sits on the right in a wooden chair with woven backrests, wearing a flowing red dress and holding a decorative bowl to her lips while facing slightly left; along the left side forms a vertical column of three cats where a striped tabby stands on hind legs reaching toward her, beneath it a black cat sits upright tilting its head up, and at the bottom a calico cat faces left with its head turned upward toward the girl. The scene rests on light wood grain flooring against a warm aged cream background dominated by vibrant red, earthy browns, and subtle tonal variations showing slight cracking textures characteristic of lithographic prints. Soft diffused lighting casts gentle shadows creating an elegant Belle Époque atmosphere with fluid stylized curves and elongated forms. No text. No lettering. No typography.",
      "style_tags": ["art_nouveau", "lithograph", "elegant", "advertising"],
      "font_style_tags": "Elegant high-contrast script typeface with flowing curves, decorative swashes, and calligraphic flourishes, designed for display use with organic movement; Classic serif typeface with moderate contrast and clean, legible strokes, rendered in a smaller size for informational text",
      "palette": {
        "primary": [0.76, 0.25, 0.17, 1.0],
        "secondary": [0.85, 0.77, 0.68],
        "accent": [0.79, 0.53, 0.24, 1.0],
        "background_tone": "warm",
        "font_on_light": [0.1, 0.1, 0.1, 1.0],
        "font_on_dark": [0.85, 0.77, 0.68, 1.0]
      }
    }
  }

Training a LoRA with Unsloth Studio

Unsloth Studio is a simple UI where you can configure and kick off training runs. Below is the screenshot from my first LoRA training (yep it did take 7h!). I am creating LoRAs for Qwen3.5-9B initially as this will be faster at inference. However if the quality is not high enough, I may switch to ~~Qwen3.5-27B~~ instead (actually now looking at Qwen3.5-35B-A3B - see below).

When setting up your training run, you upload your training data (including from recipes), select the model, type of training and training hyperparameters. My hyperparameter selection mainly came from Claude Code, you can see the selections below. I trained a 4-bit Quantized Low-Rank Adaptation or Qlora.

I have 3114 training pairs. My first LoRA only required a 1024 context window, whereas my second required 2048. Consequently my training time was 7 hours for the first LoRA and 17 hours for the second. Not sure whether that would be classified as un-slothed but maybe I just should be pleased I can do this on my desk in the first place!

Weirdly I came back to do a second training run on Qwen2.5-14B Instruct after finding Qwen3.5-9B a bit meh at following instructions. With the identical training data 2 days later it only took 4h for the first LoRA and 12h for the second. I believe Unsloth had been doing some updates, anyway it was a lot quicker the second time around. Here's the system resource usage while training:

Loading and Using the LoRA

vLLM

To export the LoRA you select the Export tab, choose the training run and the checkpoint you want to export. In both my training runs the loss increased quite a bit towards the end, so I used an earlier checkpoint with the lowest loss.

To run vLLM I am using the DGX Spark community container located here. I had some challenges getting vLLM to load LoRAs for Qwen3.5 so you need make sure you update vLLM to the latest version: ./build-and-copy.sh --rebuild-vllm.

My launch command is as follows:

vllm serve Qwen/Qwen3.5-9B \
--gpu-memory-utilization 0.4 \
--max-model-len 16384 \
--max-num-seqs 40 \
--max-num-batched-tokens 65536 \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-lora \
--trust-remote-code \
--dtype bfloat16 \
--attention-backend flashinfer \
--uvicorn-log-level info \
--lora-modules lora-name-1={path to first LoRA} lora-name-2={path to second LoRA} \
--port 8005

Key parameters are:

--gpu-memory-utilization I want to run other processes as well as inference so am keeping it relatively low. And for my use case I don't have too many tokens and each request is new, so it doesn't need too much memory.
--enable-lora tells vLLM one or more LoRAs will be used
--lora-modules provide the paths to the LoRAs and their labels.

To use the LoRAs, you specify the name of the one you want at inference in the API call: {"model": "lora-name-1", ...} or {"model": "lora-name-2", ...}. In this way vLLM hosts multiple LoRAs and dynamically applies them at inference.

Llama.cpp

You can also use llama.cpp, depending on what you are doing this might better - for example if you are only running one inference call at a time, it's a lot faster to load than vLLM. vLLM is really great if you can manage to construct your pipeline to hit it with parallel requests, but if strictly in sequence llama.cpp might suit better.

With llama.cpp you need to convert the LoRA to GGUF format first:

python /path/to/llama.cpp/convert_lora_to_gguf.py \                                                                                   
    --base-model-dir /path/to/base_model \                                                                                                           
    --lora-model-dir /path/to/lora_adapter \                                                                                                         
    --outfile /path/to/output_lora.gguf

Then provide the GGUF LoRA path/s when loading the model in llama-server:

/path/to/llama-server \ 
-m /path/to/model.gguf \ 
--host 0.0.0.0 --port 8004 \ 
-ngl 999 -c 128000 -b 8096 -ub 8096 \ 
--parallel 4 \ 
--flash-attn on \
--no-mmap \ 
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--jinja \ 
--lora /path/to/lora1_design.gguf,/path/to/lora2_layout.gguf \                                                                                   
--lora-init-without-apply \

Then when calling the API you nominate the scale for each adapter to activate or deactivate them - the below example activates the first lora in the load command and deactivates the second one {"lora": [{"id": 0, "scale": 1.0}, {"id": 1, "scale": 0.0}]}

💡

I found that when swapping LoRAs in llama.cpp it was necessary to explicitly turn 'off' the one not required by setting the scale to 0.0 at the same time as turning 'on' the required LoRA with scale 1.0. When I didn't do this the supposedly inactive LoRA appeared to still be affecting the active LoRA and changing the output.

LoRA Performance

In my initial experiments, both LoRAs seemed to improve the output of the model relative to base. This was particularly noticeable in the layout task I have implemented, which defines bboxes for text layout. Without the LoRA, the model ignored any prompts about the bbox coordinates having to be between 0-1000 (to stay visible), whereas the LoRA always returned visible bboxes. There was also better variety in image prompt creation when the design LoRA was loaded.

I wasn't completely happy with the overall output quality however, even from a base model perspective. So the next logical step was to look for a larger model...

Training LoRAs for large models (50-80 GB)

My next port of call was to find a SOTA model with higher parameter number and Mixture of Experts architecture. This combination (large parameter number and MoE) gives by far the best and fastest performance on the DGX Spark due to the smaller number of active parameters at inference. At the timing of writing two top candidates are Qwen3.5-35B-A3B (71.9GB full precision) and Gemma-4-26B-A4B (51.6 GB full precision).

For both of these models, however, the way that Unsloth loads for BF16 training will result in OOM before any training happens - the reasons for this are documented here. This means you need to use 4-bit QLoRA or a custom training setup.

At present Gemma 4 fine tuning on DGX Spark using Unsloth is not supported (feature request here) , so none of the LoRA options are yet available (4 bit or higher). That leaves Qwen3.5-35B-A3B, but Unsloth says "It is not recommended to do QLoRA (4-bit) training on the Qwen3.5 models, no matter MoE or dense, due to higher than normal quantization differences."

So I am using the custom container outlined on the DGX Spark community forum which enables full 16 bit LoRA training without crashing OOM. I made the following changes to get it working on my Spark (with help from Claude Code):

## Dockerfile                                                                   
- Base image: change python from 25.10-py3 -> 26.03-py3 (required for DGX Spark OS 7.5.0)
- `TORCH_CUDA_ARCH_LIST: 12.0` -> `TORCH_CUDA_ARCH_LIST: 12.1` (correct arch for GB10 sm121)

## train.py                                                          
- Added `import torch`                                       
- Added `device_map={"": torch.cuda.current_device()}` to `FastModel.from_pretrained`
- Added `attn_implementation="sdpa"` to `FastModel.from_pretrained`
- Added `from transformers import DataCollatorForSeq2Seq`
- Added `DataCollatorForSeq2Seq` collator to `SFTTrainer` to handle variable-length batches
- Added `dataset_kwargs={"skip_prepare_dataset": True}` to SFTConfig since dataset is pre-tokenized
- Changed `optim="adamw_8bit"` to `optim="adamw_torch"` as bitsandbytes has no pre-compiled binary for CUDA 13.2

I also had to change the training data to use the ShareGPT format required by the container: {"conversations": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}. Claude Code wrote me a script to achieve this.

The good news is this solution is now working and fine tuning the model at about 26 seconds per iteration. It will take around 8 hours to complete the first LoRA - one hour more than the Unsloth Studio training on Qwen3.5-9B where I started. It's also worth noting that my system RAM is nicely contained below OOM territory despite the large size of this model:

By the way when dealing with a large model, even when using the eager mode container, check the model shard sizes. I started loading Gemma4 for fine tuning and kept seeing memory increasing massively — I thought the eager logic was not working. Instead it was due to the fact that for some reason Google have provided the model as two shards, the first of which is 49.9GB in size! After resharding it I was then able to load it for training (which didn’t work but that’s for another reason outlined below).

A Gotcha with vLLM and LoRA for MoE models!

I finally trained both my LoRAs, and went to use them with Qwen3.5–35B-A3B only to discover that they didn’t work. This turns out to be a common challenge for MoE models, LoRAs and vLLM specifically. I also tried with Gemma4 (now available) and the same thing happened — basically garbage output.

Attempt 1: Qwen3.5–35B-A3B (MoE) with runtime LoRA in vLLM

Plan: train a LoRA, mount it in vLLM at serve time, swap adapters per request.

Training worked: Qwen3_5MoeForConditionalGeneration loaded with the _EagerSafeOpen patch. Unsloth’s FastModel.get_peft_model produced two adapters: one via target_modules for attention/shared-expert, one via target_parameters for the fused mlp.experts.gate_up_proj / mlp.experts.down_proj tensors. Merge-and-save ran end-to-end, bf16 merged model generated coherent text in transformers.

Serving failed: vLLM’s LoRA loader only parsed target_modules. target_parameters was silently ignored — adapter output was indistinguishable from the base.

Fallback attempt: merge + FP8 quantize for serving. llm-compressor produced structurally-correct FP8 output matching Qwen’s reference layout exactly. vLLM served it as garbage — !!! at temp=0, multilingual salad at temp>0. Root cause never isolated despite verifying quantization math (2.2% round-trip error, expected), unfuse logic, skip list (287/287 modules_to_not_convert matching), and chat template. The bf16 merged model worked in transformers but cost 72 GB runtime — no headroom for ComfyUI on the 128 GB pool.

Attempt 2: Gemma 4 26B-A4B (MoE) with runtime LoRA

Plan: pivot to Gemma 4 which Unsloth stated could be trained for LoRA with MoE using Unsloth Studio and exported for vLLM.

Training worked: with significant adaptation. Had to fix:

Three transformers 5.x issues on top of the existing UMA loader.
Unsloth’s Gemma 4 release ships one 47 GB shard — resharded to 11 x ~5 GB (peak per shard drops from ~94 GB to ~10 GB)
FastModel.from_pretrained bypasses the safetensors patch — load with plain AutoModelForCausalLM, then hand to FastModel.get_peft_model which wraps in-place
torchvision.nms missing C++ op kernels in the unsloth venv — register stubs via torch.library.Library
Gemma4ForConditionalGeneration.forward requires mm_token_type_ids in training — patch trainer.compute_loss to inject zeros
Unsloth’s fix_untrained_tokens crashes on meta tensors during MoE load — disable

Training completed with ~29M trainable, peak 64.6 GB, bf16.

Serving failed: three independent issues:

Gemma4ForConditionalGeneration missing SupportsLoRA mixin (vLLM PR #39291 open at the time)
get_expert_mapping not implemented for Gemma 4 — vLLM requires it for MoE expert LoRA
Text-only Gemma4ForCausalLM has LoRA (PR #38844 merged) but only for attn/MLP, not experts, and layer paths differ (model.layers.* vs model.language_model.layers.*) from the conditional-generation checkpoint.

After spending a couple of days trying various approaches with Claude Max’s assistance, I decided to wait until someone smarter than me has got this to work.

The basic problem is that the experts in MoE models add complexity to the architecture which needs to be fully supported from fine tuning through to inference and we just aren’t there yet for these SOTA models.

Back to Dense Models for Now

Where I ended up:

Unsloth Studio LoRA fine tuning works for smaller dense Qwen3.5 models on DGX Spark
You need a custom memory management solution for anything approaching 50GB in size
At the moment MoE + LoRA + vLLM does not work for training and serving SOTA models

My current approach is using a Jupyter notebook which borrows from the container I mentioned above to train LoRAs for Qwen3.5–27B so stay tuned for that outcome.

Training a LoRA with Unsloth on DGX Spark

Training Data

Training a LoRA with Unsloth Studio

Loading and Using the LoRA

vLLM

Llama.cpp

LoRA Performance

Training LoRAs for large models (50-80 GB)

A Gotcha with vLLM and LoRA for MoE models!

Attempt 1: Qwen3.5–35B-A3B (MoE) with runtime LoRA in vLLM

Attempt 2: Gemma 4 26B-A4B (MoE) with runtime LoRA

Back to Dense Models for Now

Comments

More from this blog

Qwen3-VL image analysis using vLLM on DGX Spark

Setting up a local coding assistant on DGX Spark (4 months and still trying)

Alright then...adding vLLM to the mix

Running multimodal GLM4.6V Flash on DGX Spark

Command Palette

Training Data

Training a LoRA with Unsloth Studio

Loading and Using the LoRA

vLLM

Llama.cpp

LoRA Performance

Training LoRAs for large models (50-80 GB)

A Gotcha with vLLM and LoRA for MoE models!

Attempt 1: Qwen3.5–35B-A3B (MoE) with runtime LoRA in vLLM

Attempt 2: Gemma 4 26B-A4B (MoE) with runtime LoRA

Back to Dense Models for Now

Comments

More from this blog