Strike the Pose

One free-form prompt + a brand logo → a 4-second clip of a model walking in branded sneakers. The shipping pipeline decouples logo placement from scene composition, and a vision-LLM likeness gate retries the compose step until the logo reads.

_{navy canvas, misty harbor at dawn, VHS grain}	_{cream low-top, minimalist studio, B&W}
_{all-black runners, wet pavement neon}	_{olive suede, forest trail}

Product shot

navy canvas shoe with arcade logo on lateral panel

Lands reliably. Logo likeness, shoe silhouette/colourway, canvas + leather materials. A VLM likeness gate after compose catches the off-likeness logos before they leave the stage.

Doesn't yet. Multi-logo composition, explicit texture control, explicit placement directives ("logo on heel"), suede + complex materials.

Decoupling logo placement from scene composition is the structural win. Earlier iterations tried to do both in one shot and the logo was always the first thing to break.

Scene seed

model wearing the navy canvas shoes on a misty harbor dock

Lands reliably. Scene background, setting, and lighting. The rewriter routes demographic and pose hints into the right prompt slot.

Doesn't yet. Chained branding — putting the logo on a billboard and on the shoe simultaneously breaks both. The gate only watches one surface today; multi-surface gating is the next architectural step.

Video motion

4-second walk video preview
_{4 s @ 16 fps, 1280×720}

Lands reliably. Cross-frame identity, image sharpness, low motion blur.

Doesn't yet. Mid-clip pacing wobbles, hallucinated subtitle overlays (caught by VLM at the obvious end, not yet at the corner-text end), camera movement beyond a single subtle push, aesthetic filters at strength.

Wan is a motion model, not a style model. Aesthetic directives land best when injected at the still-image step and carried forward.

Experiments

Two write-ups from the lab notebook. Each isolates a single load-bearing question.

Report	Headline finding
Likeness-retry gate	VLM gate cuts the logo-failure rate from 50% to 12%. Three retries captures almost all of the win.
Setting an aesthetic	Style directives are invisible to the motion model, visible at the still-image step, and destructive in a separate restyle pass. Stronger paths: style LoRA, video-to-video filter.

Multi-surface logo gating — likeness checks on every chained brand surface, not just the shoe.
OCR pre-check on frames — the VLM misses corner text; OCR will not.
Video-to-video stylization so aesthetic directives actually land at strength.

Stack: Python · FLUX-dev · Qwen-Image-Edit-2511 · Wan 2.2 14B I2V · gpt-4o-mini (rewriter + likeness gate) · ComfyUI · OWLv2

Repo · Architecture · Deployment

Product shot

Scene seed

Video motion

Experiments

Next