Strike the Pose
Branded short-video generator. The headline win: a vision-LLM gate lifts the usable-output rate from 63% to 88%.
One free-form prompt + a brand logo → a 4-second clip of a model walking in branded sneakers. The shipping pipeline decouples logo placement from scene composition, and a vision-LLM likeness gate retries the compose step until the logo reads.
navy canvas, misty harbor at dawn, VHS grain |
cream low-top, minimalist studio, B&W |
all-black runners, wet pavement neon |
olive suede, forest trail |
Product shot
Lands reliably. Logo likeness, shoe silhouette/colourway, canvas + leather materials. A VLM likeness gate after compose catches the off-likeness logos before they leave the stage.
Doesn't yet. Multi-logo composition, explicit texture control, explicit placement directives ("logo on heel"), suede + complex materials.
Decoupling logo placement from scene composition is the structural win. Earlier iterations tried to do both in one shot and the logo was always the first thing to break.
Scene seed
Lands reliably. Scene background, setting, and lighting. The rewriter routes demographic and pose hints into the right prompt slot.
Doesn't yet. Chained branding — putting the logo on a billboard and on the shoe simultaneously breaks both. The gate only watches one surface today; multi-surface gating is the next architectural step.
Video motion
4 s @ 16 fps, 1280×720
Lands reliably. Cross-frame identity, image sharpness, low motion blur.
Doesn't yet. Mid-clip pacing wobbles, hallucinated subtitle overlays (caught by VLM at the obvious end, not yet at the corner-text end), camera movement beyond a single subtle push, aesthetic filters at strength.
Wan is a motion model, not a style model. Aesthetic directives land best when injected at the still-image step and carried forward.
Experiments
Two write-ups from the lab notebook. Each isolates a single load-bearing question.
| Report | Headline finding |
|---|---|
| Likeness-retry gate | VLM gate cuts the logo-failure rate from 50% to 12%. Three retries captures almost all of the win. |
| Setting an aesthetic | Style directives are invisible to the motion model, visible at the still-image step, and destructive in a separate restyle pass. Stronger paths: style LoRA, video-to-video filter. |
Next
- Multi-surface logo gating — likeness checks on every chained brand surface, not just the shoe.
- OCR pre-check on frames — the VLM misses corner text; OCR will not.
- Video-to-video stylization so aesthetic directives actually land at strength.
Stack: Python · FLUX-dev · Qwen-Image-Edit-2511 · Wan 2.2 14B I2V · gpt-4o-mini (rewriter + likeness gate) · ComfyUI · OWLv2
- video-gen
- diffusion
- eval