Skip to content
WaifuStack
Go back

Adding I2V Video Generation to AI Characters

Your AI character can smile, pout, or strike a pose — in a still image. But what if she could wiggle her toes, toss her hair in the wind, or slowly cross her legs?

That’s what we built this week. Image-to-Video (I2V) generation takes a character’s existing portrait and adds natural motion to it. The result: 5-second animated clips that feel like the character is alive.

Here’s what worked, what didn’t, and the architecture we landed on.

Before & After

The source image (left) and the I2V-generated video (right):

Source character image

I2V generated animation

The Landscape: I2V Models in 2026

Image-to-Video has matured rapidly. The key player for our use case is WAN 2.1 I2V (by Alibaba), available through multiple API providers:

ProviderEndpointCost (720p, 5s)SpeedQuality
RunPodwan-2-1-i2v-720$0.30~3 minGood
fal.aifal-ai/wan-i2v~$0.25~1 minBetter

Both run the same underlying model, but fal.ai’s infrastructure produced noticeably better motion and expression quality in our testing. More on that later.


What I2V Can and Can’t Do

We burned a few hours learning what I2V can’t do. Here’s the short version.

Works great

Doesn’t work (yet)

This is a critical distinction. I2V excels at animating what’s already there — it can’t reliably add or remove objects from the scene. We tested this extensively with a shoe removal scenario (6 iterations), and the model consistently failed to maintain the “shoe-free” state after the removal motion.

So: generate the state first (via img2img), then animate it. Fighting I2V on state changes is a waste of API credits. (For more on our image generation pipeline, see Auto-Generating Character Portraits with LoRA.)


Architecture: The Two-Step Workflow

Given I2V’s limitations, the most effective workflow is:

Step 1: img2img → Generate the desired character state
Step 2: I2V    → Animate that static image

For example, to create a “barefoot toe wiggle” video:

  1. Start with a character image wearing shoes
  2. Use img2img (Qwen Image Edit) to generate a barefoot version with the same pose
  3. Feed the barefoot image into I2V with a motion prompt

This separation of concerns produces dramatically better results than trying to make I2V handle the state change itself. We use a similar approach for dynamic character visuals — generate the right base image, then build on top of it.


Implementation

The Script

Our video generation script (generate_video.py) handles the full pipeline:

# Simplified flow
def generate_video_fal(cfg, image_path, prompt, duration, ...):
    # 1. Upload source image
    image_url = fal_client.upload_file(str(image_path))

    # 2. Build arguments
    arguments = {
        "prompt": prompt,
        "image_url": image_url,
        "num_inference_steps": 30,
        "num_frames": max(41, int(duration * 16) + 1),
        "guidance_scale": 5.0,
        "resolution": "720p",
        "enable_safety_checker": False,
    }

    # 3. Submit and wait
    result = fal_client.subscribe("fal-ai/wan-i2v", arguments=arguments)

    # 4. Download result
    video_url = result["video"]["url"]
    download_with_retry(video_url, output_path)

Key design decisions:

Retry and Recovery

API reliability varies. Our production setup includes:

# Upload: 3 retries with exponential backoff
# Generation: 2 submission attempts (re-queue on failure)
# Download: 5 retries (CDN 409 errors happen ~5% of the time)
# Fallback: If all fal.ai attempts fail → auto-switch to RunPod

This layered approach means a single CDN hiccup doesn’t waste a $0.25 generation that already completed successfully.

Web UI Integration

Users can generate videos directly from the image gallery:

  1. Open any character image in the lightbox
  2. Click the 🎬 button
  3. Select a motion preset or write a custom prompt
  4. Choose duration (3-10 seconds)
  5. Hit generate — video is sent to Telegram when ready

Generate Video dialog with motion presets

The preset system stores reusable motion prompts in config.yaml:

video:
  presets:
    wiggle_toes:
      label: "Toe Wiggle"
      prompt: "bare feet playfully wiggling toes, relaxed sitting pose,
               minimal body movement, static camera"
    hair_wind:
      label: "Wind in Hair"
      prompt: "wind gently blowing through hair, soft natural sway,
               relaxed expression, static camera, smooth motion"
    leg_cross:
      label: "Leg Cross"
      prompt: "slowly crossing and uncrossing legs, elegant confident pose,
               minimal upper body movement, static camera"

This eliminates prompt engineering friction — select a preset, click generate, done.


Provider Comparison: RunPod vs fal.ai

We ran identical tests (same image, same prompt, same parameters) on both providers:

MetricRunPodfal.ai
Generation time184s39-67s
Motion qualityGoodBetter (more natural)
Expression qualityGoodBetter (subtle emotions)
Cost per video$0.30~$0.25
API stabilityVery stableOccasional CDN issues
Cold startRareCan add 10+ min

fal.ai wins on quality and speed. But without retry logic, you’ll hit CDN errors often enough to regret not having a RunPod fallback.


Prompt Engineering for Motion

Video prompts are fundamentally different from image prompts. Key principles:

1. Always specify camera behavior

❌ "she waves hello"
✅ "waving hand gently toward camera, static camera"

Without static camera, I2V may introduce dramatic panning or zooming.

2. Constrain body movement

❌ "dancing happily"
✅ "subtle body sway, minimal movement, relaxed pose"

I2V handles subtle motion well but produces artifacts with large movements.

3. Describe the motion, not the result

❌ "barefoot, no shoes" (describes a state)
✅ "wiggling toes playfully, feet swaying back and forth" (describes motion)

The model animates what you describe. State descriptions don’t give it motion to work with.

4. Add “smooth” and “slow motion” for quality

These keywords consistently improve output quality by preventing jittery or too-fast animations.


What About Frame Interpolation (FLF2V)?

We also tested First-Last-Frame-to-Video (FLF2V) — a model that takes two images (start state and end state) and generates a video transitioning between them.

The idea was compelling: generate a “shoes on” image and a “shoes off” image, then let FLF2V create the removal animation.

Result: it doesn’t work for state changes. FLF2V treats the two images as visual keyframes and morphs between them — it doesn’t understand the semantic meaning of “removing shoes.” In our tests, shoes flew through the air or feet disappeared entirely.

FLF2V is useful for pose transitions (sitting → standing) or expression changes (neutral → smile), but not for object manipulation.


Cost Analysis

At our current usage (~5 videos/day):

Monthly CostNotes
fal.ai I2V~$37.505 videos × 30 days × $0.25
RunPod I2V~$45.00Same volume × $0.30

Video generation adds roughly $1.25/day to our infrastructure costs. For the engagement value it provides (characters that move feel dramatically more alive), this is a clear win. For context on our overall cost structure, see Running an AI Bot on $50 a Month.


What’s Next


Key Takeaways

I2V is production-ready for character animation — the quality is there and the cost is under $40/month at moderate usage. The main lesson: don’t fight the model on state changes. Generate the state you want with img2img first, then animate it. Use fal.ai for quality, keep RunPod as a fallback, and save your working prompts as presets so you’re not re-engineering motion descriptions every time.


Share this post on:

Previous Post
Why Anime Beats Photorealistic AI Art for Roleplay Bots (2026)
Next Post
Shooting Styles for AI Character Photography