The Complete AI Video Workflow: Script to Publish in 2026

A repeatable AI video production pipeline — scene-by-scene scripting, shot lists as prompts, draft-cheap/finalize-premium strategy, stitching, captions, export specs, and a real 60-second example with costs.

June 10, 202615 min readSora2U Team

Anyone can generate a pretty 10-second clip. The gap between that and a finished, publishable video is where most AI video projects die: scripts written like blog posts instead of shot lists, clips that refuse to cut together, audio that jumps 6 dB between scenes, and on-screen text that came out as alien hieroglyphics. None of these are model problems — they are workflow problems.

This guide is the full pipeline we use for client deliverables: writing for AI in scene-sized units, turning the shot list directly into prompts, the draft-cheap/finalize-premium model strategy, stitching clips so the seams disappear, adding titles and captions in post, export specs per platform — and a complete worked example taking a 60-second brand video from blank page to publish with every dollar accounted for.

Step 1: Write for AI — scenes of 15 seconds or less

AI video is generated in clips, so the script must be written in clips. The hard ceiling in 2026: Seedance 2.0 tops out at 15 seconds, Kling at 10, Veo 3 at 8, Sora 2 at 4. Write every scene to fit inside the cap of the model that will render it, and the script becomes a 1:1 map to generations.

  • One scene = one location, one camera idea, one action, one beat of meaning. If a sentence contains "and then", it is two scenes.
  • Write what the camera sees, not what the viewer should feel. "She hesitates at the door, hand on the handle" beats "she is nervous".
  • Script the audio per scene too — with native-audio models the ambience is generated jointly, so it belongs in the script, not as an afterthought.
  • Read the narration aloud with a timer: comfortable speaking pace is ~2.4 words per second, so a 15-second scene carries at most ~36 spoken words.

Step 2: The shot list IS the prompt list

Convert each scene into a prompt using a fixed five-slot pattern — subject, action, environment, camera, audio — and append the same style block to every single one ("warm amber grade, 35mm film grain, soft window light"). The repeated style block is what makes ten separately generated clips look like one film. Number the prompts to match scene numbers and keep them in a spreadsheet; this file, not the edit timeline, is your project's source of truth.

A worked conversion — Scene 3 of the example below: "Vertical 9:16. Slow push-in: a ceramic mug of black coffee on a wooden counter, steam rising, morning light raking from the left, blurred kitchen background. Style: warm amber grade, 35mm film grain. Audio: gentle pour, distant kettle hum." Ready-made patterns for this format live in the Seedance prompt library.

Step 3: Draft cheap, finalize premium

The single biggest cost lever in AI video is not negotiating prices — it is never paying premium rates for exploration. Generate every scene first on a cheap, fast tier; iterate prompts until composition and motion are right; then re-run only the locked prompts on a premium model:

  • Draft tier: Seedance 1.5 at 10 credits/sec on Sora2U, or Kling 2.0 with sub-90-second generations, or free tiers (Pika, Luma) when budget is zero. Expect 2–4 takes per scene.
  • Finalize tier: Seedance 2.0 at 20 credits/sec for anything with dialogue or where native audio matters; Veo 3 (9.2/10) for hero cinematic shots under 8 seconds. One take, occasionally two.
  • The discipline: a draft is approved when you would publish its composition, not its quality. Only then does it earn a premium pass — this alone cuts project costs 60–70% versus finalizing everything (full math here).

Run both tiers in one place

Draft on Seedance 1.5 at 10 credits/sec, then finalize the locked prompts on Seedance 2.0 with native audio — same interface, same prompt.

Affiliate link — we may earn a commission at no extra cost to you.

Step 4: Stitching — cuts, color, audio

AI clips are generated independently, so the edit's job is hiding the seams. Three rules cover 90% of it:

  • Hard cuts on motion. Cut while something moves — a head turn, a pour, a step. Crossfades and wipes draw the eye straight to the seam; cuts on action hide it. Trim the first and last ~10 frames of every AI clip, where artifacts cluster.
  • One color pass over everything. Even with a fixed style block, clips drift in exposure and white balance. Match shots against the best clip using scopes (or your editor's auto color match), then drop a single LUT across the whole timeline.
  • Level the audio to one target. Native-audio clips arrive at different loudness; normalize dialogue to around -14 LUFS for social, keep music 12–18 dB under speech, and lay one continuous room-tone or music bed under everything — a continuous bed is the cheapest trick for making separate generations feel like one scene.

CapCut and DaVinci Resolve both do all of the above for free; the multi-shot techniques in our Seedance 2.0 tutorial reduce how many seams you need to hide in the first place.

Step 5: Titles and captions belong in post

Every 2026 model still mangles on-screen text — logos melt, letters invent themselves. So never prompt for text; generate clean footage and add type in the editor. Burned-in captions are non-negotiable for social (most feed viewing starts muted): auto-transcribe in CapCut or Resolve, correct names and numbers by hand, set them in a heavy sans-serif with high contrast, two lines maximum, and keep them inside the platform safe zones — the exact pixel margins per platform are in our TikTok and Reels ads guide.

Step 6: Export specs per platform

DestinationAspect & resolutionFormatNotes
TikTok / Reels / Shorts9:16 · 1080×1920H.264 MP4, 10–12 Mbps, 30 fpsBurned-in captions; audio ~-14 LUFS
YouTube long-form16:9 · 1920×1080 or 4KH.264/H.265 MP4, 35–45 Mbps for 4KUpscaled 4K masters get a better transcode than native 1080p
Website / landing page16:9 or 1:1 · 1080pH.264 MP4 + WebM fallback, under ~10 MBAssume muted autoplay — the video must work silent
Paid ads (Meta / TikTok)9:16 master + 1:1 cropH.264 MP4, under 500 MBExport the 1:1 from the same master; re-check safe zones

Export one master at the highest quality, then derive platform versions from it. Never re-compress a compressed export — generation loss stacks fast on AI footage because the codec is already fighting synthetic grain.

Worked example: a 60-second brand video, blank page to publish

The brief: a 60-second vertical launch video for a fictional specialty coffee brand, "Driftwood Coffee" — mood-driven, one spoken line, captions throughout. Here is the actual production log:

  1. Script (45 min, $0): four scenes × 15 seconds — dawn shoreline establishing shot; beans tumbling in a roaster; the slow pour in a kitchen; a woman at a window saying “Mornings worth slowing down for.”
  2. Prompts (30 min, $0): four prompts in the five-slot pattern, shared style block "muted dawn palette, soft film grain, gentle handheld".
  3. Draft pass ($0 cash / 1,800 credits): 3 takes per scene on Seedance 1.5 — 12 clips × 15s × 10 credits. Two scenes locked on take one; the pour needed all three.
  4. Finalize pass ($0 cash / 1,500 credits): four locked prompts re-run on Seedance 2.0 at 20 credits/sec (1,200 credits), plus one re-roll of the dialogue scene to fix a lip-sync drift (300 credits). Native audio carried the waves, the roaster, the pour, and the spoken line.
  5. Edit (90 min, $0): hard cuts on motion, one LUT, music bed at -16 dB under the generated ambience, captions in CapCut, loudness normalized to -14 LUFS.
  6. Export and publish (15 min): one 1080×1920 master; TikTok, Reels, and Shorts versions with the caption block repositioned per safe zone.

Totals: about 3,300 credits on Sora2U — roughly $10–17 in pay-as-you-go terms — and around four hours of human time, most of it script and edit. The same deliverable from a small production company starts near $2,000 and takes two weeks. That is the entire argument for owning this workflow.

Get one workflow teardown per week

Real projects, real prompts, real costs — we publish the production log of one AI video every week.

Frequently Asked Questions

How do I make an AI video from a script?

Split the script into scenes of 15 seconds or less, convert each scene into a structured prompt (subject, action, environment, camera, audio) with a shared style block, draft every scene on a cheap model, finalize locked prompts on a premium one, then stitch with hard cuts, one color pass, and leveled audio.

How much does a 60-second AI video cost in 2026?

With a draft-cheap/finalize-premium workflow, roughly $10–17 in pay-as-you-go terms (about 3,300 credits on Sora2U): drafts on Seedance 1.5 at 10 credits/sec and finals on Seedance 2.0 at 20 credits/sec. Finalizing everything at premium rates without drafting costs 2–3× more.

Why do my AI video clips look inconsistent when edited together?

Because each clip is generated independently. Fix it at the prompt level with a repeated style block and reference assets, then in post with a single LUT across the timeline, cuts on motion instead of crossfades, and one continuous music or room-tone bed.

Can AI video generators render on-screen text?

Not reliably — every mainstream 2026 model still garbles legible type. Generate clean footage and add all titles, captions, and logos in an editor like CapCut or DaVinci Resolve, keeping them inside each platform's safe zones.

What export settings should I use for AI video?

H.264 MP4 almost everywhere: 1080×1920 at 10–12 Mbps for TikTok/Reels/Shorts, 1080p or upscaled 4K at 35–45 Mbps for YouTube, and a sub-10 MB muted-autoplay version for websites. Always derive platform versions from one high-quality master instead of re-compressing exports.

The Complete AI Video Workflow: Script to Publish in 2026 | Sora2U | Sora2U — Free AI Video Generator