One of our characters in Suzune has a secret: she’s beautiful, but she hides it.
Mao is designed as the “plain office worker” archetype — no makeup, thick glasses, hair with no styling, clothes buttoned up to the collar. But when she decides to dress up? Different person. Crimson lipstick, smokey eyeshadow, hair styled, glasses off. Same woman, completely different energy.
The technical challenge: how do you make an AI image generation system produce two visually distinct versions of the same character, automatically, based on the story context?
Here’s how we solved it.
Table of contents
Open Table of contents
The Problem: One Face, Two Modes
Most AI character image systems use one of two approaches:
- LoRA models — A fine-tuned model that always produces the same character look
- Appearance prompts — Text descriptions that guide generation (inconsistent across images)
Neither handles the “transformation” use case well:
- A LoRA trained on “plain Mao” can’t produce “glamorous Mao” — it always generates the trained look
- Pure text prompts (“add makeup”) produce wildly inconsistent results, often changing facial features entirely
What we needed: a system that swaps the character’s visual foundation based on what’s happening in the story.
The Solution: img2img Base Image Variants
Instead of LoRA, Mao uses img2img generation — the model takes a base image as input and transforms it according to the prompt while preserving the core facial structure.
The key insight: use different base images for different character states.
characters/mao/
├── base_image.png ← Default: glasses, no makeup, plain
├── base_image_serious.png ← Activated: no glasses, makeup-ready face
└── character.yaml
Both images are the same person with the same facial structure. But the base composition is different:
| Default (base_image.png) | Serious (base_image_serious.png) | |
|---|---|---|
| Glasses | On | Off |
| Expression | Neutral, slightly stiff | Confident, composed |
| Hair | Unstyled, parted in middle | Slightly refined |
| Makeup | None | Minimal (ready for prompt to add more) |
The img2img pipeline then applies the scene prompt on top of whichever base is active. The result: consistent facial identity with dramatically different vibes.
Automatic Switching Logic
The base image swap is triggered by outfit keywords. When the character’s current outfit includes makeup items, the system automatically switches to the serious base image.
Here’s the core logic:
# Detect makeup in current outfit → switch base image
makeup_keywords = (
"lipstick", "eyeshadow", "mascara",
"eyeliner", "makeup", "cosmetics"
)
if any(kw in clothing.lower() for kw in makeup_keywords):
serious_path = base_image.parent / "base_image_serious.png"
if serious_path.exists():
base_image = serious_path # swap!
That’s it. The detection is deliberately simple — keyword matching on the outfit string. No ML, no complex logic. Just: does the outfit mention makeup? → use the makeup-ready face.
Why Keywords Instead of Something Smarter?
Because outfit descriptions are generated by our own system (the wardrobe engine), so we control the vocabulary. We don’t need fuzzy matching when we write the prompts ourselves.
The wardrobe entry for Mao’s “queen mode” makeup looks like this:
{
"name": "Seductive Queen Makeup (Mao exclusive)",
"prompt": "seductive queen makeup, no glasses, dark crimson lipstick,
heavy mascara, smokey eyeshadow, sharp eyeliner,
flawless porcelain skin",
"tags": ["makeup", "seductive", "queen", "mao"],
"exclusive": ["mao"]
}
When this wardrobe item is active, the clothing string contains “lipstick” and “eyeshadow” → the keyword check fires → base image swaps.
The Visual Pipeline
Here’s the full flow when Mao generates a selfie:
1. LLM decides to send a selfie (tool call)
│
2. Load daily outfit from daily_outfit.json
│
3. Check outfit for makeup keywords
│
┌────┴────┐
│ No │ Yes
▼ ▼
base_image base_image_serious
.png .png
│ │
└────┬────┘
│
4. Encode base image as input to SDXL img2img
│
5. Apply scene prompt + outfit prompt
│
6. RunPod generates image → send to chat
The beauty of this approach: the switching is invisible to the LLM. The character doesn’t need to know which base image is being used. It just describes the scene, and the image system handles the visual consistency.
Designing the Base Images
The hardest part isn’t the code — it’s creating base images that work well with img2img.
Rules for Good Base Images
1. Same face, different energy
Both base images must be unmistakably the same person. Use the same:
- Facial proportions and structure
- Skin tone
- Hair color and approximate length
- Eye color and shape
Change only:
- Expression
- Accessories (glasses on/off)
- Hair styling
- Makeup level
2. Neutral enough for img2img to work with
Base images shouldn’t be too detailed or specific. The img2img pipeline needs room to apply the scene prompt. If the base image is already wearing a red dress, it’s harder for the model to generate her in a white one.
Keep base images in simple clothing or a neutral composition.
3. High denoising strength for outfit changes, low for facial consistency
This is the balancing act of img2img:
| Denoising Strength | Effect |
|---|---|
| 0.3–0.4 | Face stays very consistent, but outfits barely change |
| 0.5–0.6 | Good balance — face recognizable, outfits change well |
| 0.7–0.8 | Outfits change dramatically, but face may drift |
We typically use 0.5–0.6 for general scenes and 0.4–0.5 for close-up portraits where facial consistency matters most.
Why This Matters for Character Design
The base image variant system isn’t just a technical feature — it’s a character design tool.
Mao’s “transformation” is part of her character arc. She’s designed as someone who doesn’t care about appearances, but when the moment calls for it — a date, a confrontation, or a moment of confidence — she transforms.
This mirrors a popular character archetype in anime and manga: the “hidden beauty” (隠れ美人). The gap between her daily appearance and her full potential is part of what makes her compelling.
Without dynamic visuals, this character concept falls flat. You can describe the transformation in text, but showing it in generated images makes it visceral.
The Gap Effect
In character design, “gap moe” (ギャップ萌え) — the appeal of contrast — is one of the most powerful tools:
- A tough character showing vulnerability
- A serious character laughing unexpectedly
- A plain character revealing hidden beauty
The base image switching system lets us express gap moe visually, not just textually. And users absolutely notice.
Extending the Pattern
While Mao uses “plain → glamorous,” the same pattern works for many character transformations:
| Character Concept | Default Base | Variant Base | Trigger |
|---|---|---|---|
| Hidden beauty (Mao) | Glasses, no makeup | No glasses, confident | Makeup keywords |
| Warrior/fighter | Casual clothes | Battle-ready, intense | Weapon/armor keywords |
| Shy character | Averted gaze, closed posture | Eye contact, open posture | High affection score |
| Idol/performer | Offstage casual | Stage costume, spotlights | Performance scene |
The trigger doesn’t have to be outfit-based either. You could switch base images based on:
- Affection score — character’s expression softens as they warm up to you
- Time of day — sleepy base image at night, energetic during day
- Emotional state — happy/sad/angry base images selected by the emotion detection system
We haven’t implemented all of these yet, but the architecture supports them with minimal changes.
Implementation Checklist
If you want to add this to your own bot:
- Create 2+ base images of your character with consistent facial features but different compositions
- Name them consistently —
base_image.png(default),base_image_[variant].png - Add keyword detection in your image generation pipeline — check the outfit/scene string for trigger words
- Swap the base image path before passing it to your img2img model
- Test denoising strength — find the sweet spot between outfit flexibility and facial consistency
The code change is genuinely small (< 10 lines). The character design work is where the real effort goes.
Tools We Use
- SDXL via RunPod for img2img generation
- Custom character base images (hand-generated, then refined)
- Python + Telegram for the bot pipeline
If you’re not ready to build your own image pipeline, platforms like Candy AI and DreamGF offer built-in character customization with appearance variants — not as flexible as a custom system, but a good starting point.
This article is part of WaifuStack’s series on building AI roleplay bots. See also: Prompt Engineering for Immersive Roleplay and From Idea to Production.
Working on something similar? Share your approach on X.