Seedance Image to Video: The Complete AI Animation Tutorial (2026) | Sora2U

Image-to-video is the highest-leverage feature in Seedance 2.0 and the one most people use wrong. Instead of gambling on what the model imagines from text, you hand it a frame that is already correct — your product, your face, your composition — and spend the prompt purely on motion and audio. In our testing, i2v cuts attempts-to-usable-clip roughly in half versus text-to-video for any subject that must look like a specific real thing.

This tutorial covers the full i2v workflow: when to choose it over text-to-video, how to prepare source images, the difference between an init image and Seedance's 12-asset reference system, motion-prompt patterns for products, portraits, and landscapes, and the artifact fixes we use daily. Everything here runs on the Sora2U Seedance generator — upload an image, write the motion, generate.

When image-to-video beats text-to-video

The rule of thumb: if the first frame must be exact, use i2v; if the motion is the point, use t2v. Concretely, i2v wins when:

A real product must stay recognizable — packshots, devices, apparel. Text descriptions of your product always drift; a photo never does.
A specific face or brand look is non-negotiable — founder portraits, recurring characters, a locked color grade.
You already paid for the still — product photography and real-estate shots become video at marginal cost.
Composition is precise — rule-of-thirds layouts, negative space for text overlays, exact framing for an ad slot.

Text-to-video stays the better tool for camera-driven shots (drone pullbacks, tracking shots) and scenes where the model should invent the world. For those, start with the Seedance prompt engineering guide — the 4-block structure there is the foundation this article builds on.

Preparing the source image

Seedance inherits everything from your source frame — sharpness, grain, color cast, and mistakes. Thirty seconds of prep saves three re-generations:

Resolution: upload at least the height of your output — 1080px+ for 1080p video. Upscale a soft image *before* upload; Seedance amplifies blur into smear.
Aspect ratio: crop to the target ratio yourself (16:9, 9:16, or 1:1). Auto-cropping decides for you and routinely beheads subjects in vertical output.
Headroom for motion: leave 10–15% of empty frame in the direction you plan to move — a dolly-in needs background to consume; a pan needs somewhere to go.
Clean edges: watermarks, borders, and timestamp overlays become writhing artifacts once motion starts. Remove them first — our watermark removal tool handles this in one pass.
One clear subject: i2v animates hierarchies badly. A frame with five equal subjects produces five half-animations.

Init image vs reference assets

Seedance gives you two ways to feed images, and confusing them is the most common i2v mistake. An init image is frame one — the video literally starts from it. Reference assets (up to 12 per generation: images, video snippets, audio) are guidance — they tell the model what a character, product, or style should look like *throughout*, without dictating the first frame.

	Init image	Reference assets
Role	Literal first frame	Identity and style guidance
How many	One	Up to 12 (images, video, audio)
Controls	Composition, framing, lighting of frame 1	How subjects look across all frames
Best for	Animating a specific photo	Consistent characters across shots and clips
Failure mode if misused	Identity drifts after ~5s	First frame ignores your composition

The pro pattern is to combine them: init image for the exact opening frame, plus 2–3 reference shots of the same subject from other angles. The references are what keep identity locked when the subject turns or the camera moves — frame one alone cannot teach the model what the back of a product looks like.

Animate your first photo in one pass

Upload an image to Seedance 2.0, add a one-sentence motion prompt, and get 1080p video with native audio — up to 15 seconds.

Affiliate link — we may earn a commission at no extra cost to you.

Try image-to-video now

Writing motion prompts on top of an image

With i2v, the image already answers "what does it look like" — so a motion prompt that re-describes the picture wastes tokens and invites drift. Describe only what changes: subject motion, camera motion, and audio. Keep it under ~40 words.

Weak: "A beautiful woman with red hair in a yellow coat stands on a bridge in autumn..." (re-describing the image — the model may "correct" details you wanted kept).
Strong: "She turns toward the camera and smiles, hair lifting in the breeze. Slow dolly in. Audio: river below, distant birdsong."
Strong (product): "The bottle rotates 90 degrees as condensation beads roll down the glass. Static camera. Audio: soft fizz, ambient lounge."

Calibrate motion intensity explicitly — "subtle", "gentle", or "slow" for portraits and products; reserve "dynamic" for scenes that can survive deformation. Unprompted, Seedance defaults to more motion than most commercial shots want. More tested motion templates are tagged i2v in the Seedance prompt library.

Three playbooks: products, portraits, landscapes

Product shots

Animate one property per clip: a rotation, a lid opening, liquid pouring, fabric settling. Keep the camera static or a slow dolly-in, and let native audio sell the material — a click, a fizz, a fabric rustle. This is the cheapest product video pipeline in 2026: existing packshot in, 5-second hero clip out, for under a dollar in pay-as-you-go terms. The full funnel math is in our e-commerce video guide.

Portraits

Faces forgive the least. Ask for micro-motion only: a blink, a slow smile, a slight head turn under 30 degrees, hair in a breeze. Add 1–2 extra photos of the same face as reference assets, and keep clips at 5–10 seconds — identity hold degrades measurably past the 10-second mark. With dialogue, Seedance lip-syncs a portrait to scripted lines, but the head should stay near-frontal for clean phonemes.

Landscapes

Landscapes tolerate the most motion, so this is where slow camera moves shine: drift clouds, ripple water, sway grass, then add a slow pan or dolly. Describe the ambient audio bed ("wind through pines, distant surf") — a silent landscape clip reads as a cinemagraph, not video. Ten to fifteen seconds works here precisely because there is no identity to hold.

Keeping identity stable

Add 2–3 reference assets of the subject from different angles — the single highest-impact fix for drift.
Cap clips at 10 seconds when a face must hold; generate two clips and cut rather than one long one.
Keep subject motion under 90 degrees of rotation per clip — full turns force the model to invent the unseen side.
Name the subject consistently in the prompt ("the woman from the reference images"), never re-describe her features in words.
For a series, reuse the identical reference set and seed-image style across every generation in the batch.

If you need faster, cheaper iteration loops for non-identity shots, Seedance 1.5 at 10 credits/sec on Sora2U is the draft tier — full credit math on the pricing page. For how Seedance's i2v stacks up against Kling's, see the Seedance 2.0 vs Kling comparison.

Common artifacts and fixes

Face morphs mid-clip — add reference photos of the same face; shorten the clip to 5–8s; reduce requested motion.
Warping hands and fingers — keep hands still or out of frame in the source image; prompt hand motion only when essential.
Background "breathing" or wobble — the motion prompt is too aggressive for the scene; add "static camera, subtle motion" and re-run.
Logos and label text melting — keep the labeled side facing camera and rotation under 90 degrees; restore the logo in post for hero shots.
First-frame color shift — Seedance occasionally regrades frame 1; bake your color grade into the source and state "preserve original color grading".
Frozen subject, moving camera only — your prompt described only camera motion; give the subject one explicit verb.

Weekly i2v patterns from real tests

We animate hundreds of stills a month and send the motion prompts that survive — one email a week.

Frequently Asked Questions

How do I turn a photo into a video with AI?

Upload the photo as an init image in an image-to-video model like Seedance 2.0 on the Sora2U generator, write a short motion prompt describing only what should change ("she turns and smiles, slow dolly in"), and generate. Clips run up to 15 seconds at 1080p with native audio.

When should I use image-to-video instead of text-to-video?

Use i2v whenever the first frame must be exact — real products, specific faces, locked compositions, or stills you already own. Use text-to-video when the model should invent the scene or the shot is camera-motion driven.

What resolution should my source image be for AI video?

At least the height of your target output — 1080px or more for 1080p video — and pre-cropped to the target aspect ratio (16:9, 9:16, or 1:1). Upscale soft images before upload, because i2v amplifies blur into smear.

How do I keep a face consistent in AI image-to-video?

Combine an init image with 2–3 reference photos of the same face from different angles, keep clips at 5–10 seconds, limit head rotation to under 30 degrees, and refer to "the person from the reference images" instead of re-describing features in text.

Can Seedance animate product photos for ads?

Yes — this is one of its strongest commercial uses. Upload a packshot, animate one property per clip (rotation, lid opening, pour), keep the camera static, and let native audio add the click or fizz. A 5-second hero clip costs well under a dollar in pay-as-you-go terms.