Running multimodal GLM4.6V Flash on DGX Spark

I’m working on a poster generation project with a plan to train a model on typesetting layouts. To do this I am extracting example layouts from poster images using the multimodal model GLM4.6V Flash. It turns out this model is a little temperamental to get working so I wanted to share some tips and tricks for anyone else attempting to get it running locally.

Why GLM4.6V Flash?

I wanted a model which could analyse a poster image and return a text layout in JSON format, while being small enough to run on my DGX Spark. The GLM models are built by z.ai, a Chinese AI lab. GLM4.6V is their most recent multimodal model and GLM4.6V Flash is a 9B parameter version.

In my testing it did a decent job of ingesting an image, analysing it and producing a structured JSON output describing the image. It takes a couple of minutes to analyse each image (depending on complexity).

Serving GLM4.6V Flash locally using llama.cpp

I use llama.cpp for my local inference engine so I need GGUF files. You can find one on Hugging Face, here is the link to the llama.cpp repository. TWO files are needed to enable multimodal:

The main model GGUF: GLM-4.6V-Flash-Q8_0.gguf
The mmproj GGUF: mmproj-GLM-4.6V-Flash-Q8_0.gguf

The 8-bit quant of this model is 11GB including the mmproj file so it runs quite happily on a DGX Spark. To run the model using llama-server I use the following:

/{your-llama-server-binary}/llama-server -m /{your-model-location}/GLM-4.6V-Flash-Q8_0.gguf -c 128000 -b 8096 --no-mmap --jinja --flash-attn on -ngl 999 -ub 8096 --host 0.0.0.0 --mmproj {your-model-location}/mmproj-BF16.gguf --port 8051

Using the model in Python

Of course with the recent llama.cpp web UI release, you can use this in the web UI and upload an image for analysis - the web UI has baked in default parameters which work straight away.

However I am using it as part of a pipeline, so I need to invoke the model in my code. This turned out to be tricky with standard payload parameters and I struggled to get it to return anything to my model calls. In the end I uploaded an image in the Web UI and inspected the payload to work out how llama.cpp handles this model.

Successful invocation in the WebUI uses the following payload structure, where the content list is a text prompt and the image I want analysed:

payload = {
            "messages": [{"role": "user", "content": content_list}],
            "thinking": "disabled",
            "stream": False,
            "return_progress": True,
            "temperature": 0.8,
            "max_tokens": -1,
            "dynatemp_range": 0,
            "dynatemp_exponent": 1,
            "top_k": 2,
            "top_p": 0.6,
            "min_p": 0.05,
            "xtc_probability": 0,
            "xtc_threshold": 0.1,
            "typ_p": 1,
            "repeat_last_n": 64,
            "repeat_penalty": 1.1,
            "presence_penalty": 0,
            "frequency_penalty": 0,
            "dry_multiplier": 0,
            "dry_base": 1.75,
            "dry_allowed_length": 2,
            "dry_penalty_last_n": -1,
            "samplers": ["penalties", "dry", "top_n_sigma", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature"],
            "timings_per_token": False
        }

To call this in my code I use:

        response = requests.post(url, json=payload, timeout=MULTIMODAL_REQUEST_TIMEOUT)

url is the server endpoint http://127.0.0.1:8051/ (based on my launch command above)
payload is the prompt and parameters specified above
timeout is another critical parameter - for the complexity of my call this is set at 10,000 seconds - below this I was getting timeouts and null responses from the model

I also include “/nothink” in my prompt to reduce the reasoning time, which is known to cause loops and delays - this doesn’t seem to affect the output quality although I haven’t tested it. There is also a parameter in the payload called ‘thinking’ which can be enabled or disabled for chain-of-thought thinking. I use both of these to minimise unnecessary reasoning time.

Example Usage

I have used this poster by Julius Klinger for the Berlin Zoological Garden as a test image.

The prompt is pretty lengthy:

PROMPT = """Extract all text and layout data from this poster as valid JSON. For all text elements include a visual_description explaining the typography in natural language. 

OUTPUT FORMAT (copy this structure exactly):
{
  "poster_id": "filename",
  "canvas": {"width": 1200, "height": 1600},

  "background": {
    "description": "Comprehensive visual description WITHOUT any text. Describe all illustrations, shapes, patterns, photographs, textures. Include spatial relationships, layering, depth, perspective, and how elements relate to each other compositionally.",

    "gradient": {
      "type": "linear|radial",
      "angle": 135,
      "stops": [
        {"position": 0.0, "color": "#hex"},
        {"position": 0.5, "color": "#hex"},
        {"position": 1.0, "color": "#hex"}
      ]
    },

    "primary_elements": [
      {
        "type": "illustration|shape|pattern|photo|texture", 
        "description": "Detailed visual description including size, style, artistic technique", 
        "position": "center|left|right|upper_third|background_layer", 
        "color": "#hex", 
        "style": "iconic|geometric|organic|photographic|hand_drawn|abstract",
        "visual_weight": "dominant|supporting|subtle"
      }
    ],

    "color_palette": {
      "primary": "#hex", 
      "secondary": "#hex", 
      "accent": "#hex"
    },

    "composition": {
      "type": "symmetrical|centered|radial|diagonal|asymmetric", 
      "focal_point": {"x": 0.5, "y": 0.4},
      "depth": "flat|layered|dimensional",
      "balance": "symmetrical|weighted_left|weighted_right|dynamic"
    },

    "patterns": ["sunburst_rays", "stripes", "dots", "halftone", "geometric_grid"],

    "texture": {
      "type": "screen_print|smooth|distressed|grain|paper", 
      "intensity": "subtle|moderate|heavy"
    },

    "style": "mid_century_modern|art_deco|psychedelic|vintage|swiss|constructivist|pop_art",
    "mood": "vibrant, energetic, nostalgic",
    "visual_techniques": ["gradient|flat_color|halftone|layering|transparency|screen_tone"],
    "training_caption": "Style + elements + colors + composition + mood (2 sentences, NO text content)"
  },

  "typography_overview": {
    "overall_style": "Describe the typesetting approach: hierarchy strength, alignment patterns, spacing philosophy, visual rhythm",
    "dominant_characteristics": ["wood_type|geometric_sans|serif_display|script|mixed_styles"],
    "layout_strategy": "centered|asymmetric|grid_based|stacked|diagonal|organic",
    "text_treatment": "minimal|layered|decorative|dimensional|outlined",
    "type_color_relationship": "How text colors interact with background - contrast level, color harmony, readability approach",
    "historical_reference": "What typesetting era/movement does this reference (e.g., Victorian wood type, Swiss International, 1960s psychedelic display)"
  },

  "elements": [
    {
      "id": "headline_main|subhead_01|subhead_02|body_text",
      "element_type": "text_box|text_path|text_graphic",
      "content": "EXACT TEXT",

      "typography": {
        "role": "headline|subhead|body|caption",
        "classification": "sans|serif|slab|script|display",
        "weight": "light|regular|medium|bold|black",
        "width": "condensed|normal|extended",
        "case_usage": "all_caps|title_case|mixed|lowercase",
        "size": 96,
        "color": "#hex",
        "visual_description": "Describe any visual treatment applied to the text beyond plain text on background. Include how the text interacts with shapes, colors, textures, patterns, outlines, knockouts, blocks, stripes, or other graphic elements. Focus on visual intent, contrast strategy, and stylistic effect. Use natural language.",
       },
      "bounds": {
        "left": 0.075,
        "top": 0.18,
        "right": 0.925,
        "bottom": 0.33
      },
      "alignment": "left|center|right"
      "rotation": 0,
      "letter_spacing": 0.05,
      "line_height": 1.1,
      "transform": "uppercase|lowercase|none",
      "z_index": 10,

      "shadow": {
        "offset_x": 4,
        "offset_y": 4,
        "blur": 8,
        "color": "#000000",
        "opacity": 0.5
      },

      "path": {
        "type": "arc|bezier|wave",
        "points": [[x,y], [x,y]],
        "direction": "up|down"
      },

      "graphic_style": "psychedelic_warped|melting|hand_drawn"
    }
  ],

  "_metadata": {
    "poster_info": {
      "year": "1965|1970s|circa 1950|unknown",
      "era": "1920s|1930s|1950s|1960s|1970s|1980s",
      "artist": "Artist name or unknown",
      "designer": "Designer/studio name or unknown",
      "origin_country": "USA|UK|Switzerland|Japan|etc"
    },
    "visual_style": ["centered", "bold_hierarchy"],
    "mood": ["energetic", "nostalgic"],
    "text_density": "low|medium|high",
    "complexity": "simple|moderate|complex",
    "design_movement": "art_deco|bauhaus|swiss_international|psychedelic|pop_art|constructivist|modernist",
    "color_metadata": {
      "color_count": "monochrome|2-color|3-color|full-color",
      "dominant_hues": ["red", "blue", "yellow"],
      "saturation": "muted|moderate|vibrant|neon"
    },
    "training_captions": {
      "lora_style": "[Era] [movement] poster with [visual elements], [color palette], [composition type], [mood]",
      "typesetter": "[Hierarchy] with [font styles], [layout strategy], [spacing approach] NO"
    }
  }
}

ELEMENT TYPE RULES:

text_box:
- Standard rectangular text (any rotation angle)
- REQUIRED: typography, position, box, letter_spacing, line_height, transform, visual_description
- OPTIONAL: shadow

text_path:
- Curved or flowing text following a path
- REQUIRED: typography, path, visual_description, box may be included as approximate bounds

text_graphic:
- Warped, organic, illustrative, or hand-drawn lettering
- REQUIRED: content, graphic_style, visual_description

EXTRACTION REQUIREMENTS:
Text spacing guidelines:
- letter_spacing: -0.05 (very tight) to 0.15 (very loose), 0.0 = normal
- line_height: 0.9 (compressed) to 1.8 (airy), 1.2 = normal

Shadow: Include if text has visible drop shadow, outline, or glow effect

Gradient: Describe background gradient with accurate colors and direction
- linear: specify angle (0=right, 90=down, 135=diagonal, 180=left)
- radial: center outward

Background:
- Describe visual composition WITHOUT any text
- List all significant colors as hex codes
- Create training_caption combining style + elements + colors + mood (exclude text)

Tags:
- visual_style: wood_type, serif, sans_serif, geometric, brutalist, minimal, high_contrast, maximalist, flat_design, duotone
- mood: energetic, nostalgic, playful, elegant, vintage, psychedelic, dramatic, futuristic, serene, authoritative

OUTPUT: Valid JSON only. No explanations. No markdown. Ensure matching { and }"""

It took 142 seconds to process, with 2,547 tokens in the prompt and 7,538 in the response:

Below is an example of the JSON that the model created for the headline:

        "elements": [
    {
      "id": "headline_main",
      "element_type": "text_box",
      "content": "ZOOL OGISCHER GARTEN",
      "typography": {
        "role": "headline",
        "classification": "sans-serif_display",
        "weight": "bold",
        "width": "normal",
        "case_usage": "all_caps",
        "size": 84,
        "color": "#000000",
        "visual_description": "Large bold black sans-serif text centered at bottom of poster, creating strong visual impact against light background. Text is highly legible with thick strokes and clean geometric forms."
      },
      "bounds": {
        "left": 0.075,
        "top": 0.85,
        "right": 0.925,
        "bottom": 0.97
      },
      "alignment": "center",
      "rotation": 0,
      "letter_spacing": -0.02,
      "line_height": 1.2,
      "transform": "uppercase",
      "z_index": 20,
      "shadow": {
        "offset_x": 3,
        "offset_y": 3,
        "blur": 5,
        "color": "#000000",
        "opacity": 0.4
      }
    },

And here is the result with the red boxes indicating the JSON bounding boxes. I need to do some work on the font alignment and wrapping but the general extraction works pretty well.