Composite Multi-Character Image Generation

You have 40+ characters. Each has a different face, body type, and daily outfit. You want to show four of them relaxing together in a cafe — one image, posted automatically every evening.

The obvious approach: generate four individual images, stitch them into a grid. Four API calls, four queue waits, four chances of failure. At $0.02 per generation, that’s $0.08 per day, $2.40 per month. Not terrible, but not elegant either.

Here’s how we do it with one API call.

Open Table of contents

The Feature: Break Room Cafe
The Naive Approach (And Why It’s Wasteful)
The Composite Approach
The Result
Cost Comparison
When This Doesn’t Work
Scheduling and Automation
Time-of-Day Lighting
The Blog-Worthy Takeaway

The Feature: Break Room Cafe

Every evening at 6 PM JST, Suzune automatically selects four characters, generates a single image of them sitting in a cafe with their current day’s outfits, and posts it to the collection channel. The goal is twofold:

Rediscovery — Characters you haven’t chatted with recently get selected, reminding you they exist
Outfit showcase — Characters with high affection scores get selected too, showing off their best looks

The selection algorithm is simple: top 2 by affection + excitement score, plus 2 random picks for variety. Only characters with today’s outfit and a base image qualify.

The Naive Approach (And Why It’s Wasteful)

The straightforward solution looks like this:

For each of 4 characters:
  1. Load character's base_image
  2. Build prompt with appearance + outfit
  3. Call img2img API → wait for result
  4. Save individual image
Stitch 4 images into grid with Pillow
Post to channel

This works. But it has problems:

4 API calls at $0.02 each = $0.08/run
4 queue waits — if the endpoint is cold, each job queues separately
Partial failure handling — what if 1 of 4 fails? Do you post a 3-image grid?
No scene coherence — each character is generated independently, so lighting, perspective, and atmosphere don’t match

The Composite Approach

Here’s what we actually do. One API call:

1. Select 4 characters
2. Tile their base_images into a single horizontal strip
3. Build one prompt with positional descriptions
4. Single img2img call with the composite as input
5. Post the result

The key insight: img2img doesn’t care that your input image is a composite. If you tile four face images side by side and describe what each position should look like, the model transforms the entire composite as one scene.

Step 1: Build the Composite Base Image

Each character in Suzune has a base_image.png — a reference photo that anchors their face and body type during img2img generation. We take these four images and tile them horizontally:

def build_composite_base(char_infos, output_path):
    cell_w, cell_h = 384, 1024  # narrow portrait slices
    n = len(char_infos)
    canvas = Image.new("RGB", (cell_w * n, cell_h), (60, 50, 45))

    for i, info in enumerate(char_infos):
        img = Image.open(info["char"].base_image).convert("RGB")
        # Center-crop to narrow portrait ratio
        orig_w, orig_h = img.size
        target_ratio = cell_w / cell_h
        orig_ratio = orig_w / orig_h
        if orig_ratio > target_ratio:
            new_w = int(orig_h * target_ratio)
            left = (orig_w - new_w) // 2
            img = img.crop((left, 0, left + new_w, orig_h))
        else:
            new_h = int(orig_w / target_ratio)
            top = (orig_h - new_h) // 2
            img = img.crop((0, top, orig_w, top + new_h))
        img = img.resize((cell_w, cell_h), Image.LANCZOS)
        canvas.paste(img, (i * cell_w, 0))

    canvas.save(output_path, quality=95)

The result is a 1536x1024 image with four narrow portraits side by side. This becomes the input to img2img.

Why center-crop to narrow slices? Because each character’s base image is a standard portrait (768x1024). If we tiled them at full width, the composite would be 3072 pixels wide — too large for most endpoints. Cropping to 384px keeps each character’s face centered while fitting four into a 1536px canvas.

Step 2: Build Positional Prompts

This is where the magic happens. Instead of one character description, we describe all four with positional labels:

POSITION_LABELS = {
    2: ["left", "right"],
    3: ["left", "center", "right"],
    4: ["far left", "center-left", "center-right", "far right"],
}

def build_prompt(char_infos):
    n = len(char_infos)
    positions = POSITION_LABELS.get(n)
    person_descs = []

    for i, info in enumerate(char_infos):
        char = info["char"]
        pos = positions[i]
        # Face/hair from appearance_prompt, clothing stripped
        face = strip_clothing_from_appearance(char.appearance_prompt)
        outfit = summarize_clothing(info["clothing"])
        desc = f"{pos}: {face}, relaxed smile, wearing {outfit}"
        person_descs.append(desc)

    prompt = (
        f"{n}girls, candid wide-angle photo of a cafe interior, "
        + ", ".join(person_descs)
        + ", in a stylish modern Japanese cafe, each at separate tables, "
        "natural relaxed poses, photorealistic"
    )
    return prompt

A real prompt looks like this (abbreviated):

4girls, candid wide-angle photo of a cafe interior,
far left: platinum blonde long hair, gyaru makeup, wearing black top and skinny jeans,
center-left: short dark hair, demon costume with horns,
center-right: chubby curvy, milk tea blonde hair, wearing tailored jacket,
far right: straight black hair, thick glasses, wearing navy business suit,
in a stylish modern Japanese cafe, each at separate tables,
natural relaxed poses, photorealistic

Step 3: Generate

One subprocess call to our existing generate.py pipeline:

cmd = [
    python, "scripts/generate.py",
    "--prompt", prompt,
    "--image", str(composite_base_path),
    "--width", "1536",
    "--height", "1024",
    "--strength", "0.55",
    "--negative-prompt", ANTI_FIGURINE_NEGATIVES,
    "--output", str(output_path),
]

The critical parameter is --strength 0.55. This controls how much the model deviates from the input image:

Strength	Effect
0.30	Too rigid — characters stay in their original poses, no cafe scene
0.45	Standard for single-character img2img
0.55	Good balance — faces preserved, cafe scene applied, outfits transformed
0.65	Too much deviation — faces start blending between characters

We also add anti-figurine negative prompts (figurine, doll, plastic skin, wax figure, mannequin, 3d render) because composite inputs sometimes trigger the model into producing doll-like renders instead of photorealistic people.

The Result

From the composite base (four separate portraits tiled):

…the model produces a coherent cafe scene where each character sits at their own table, wearing their assigned outfit, with consistent lighting and perspective. One API call, ~14 seconds of generation time, $0.02.

Cost Comparison

Approach	API Calls	Cost/Run	Monthly (daily)
Individual + Grid	4	$0.08	$2.40
Composite	1	$0.02	$0.60
Group txt2img (no base)	1	$0.02	$0.60

The third option — pure txt2img with no base image — costs the same but produces inconsistent faces. Without base images anchoring each character’s identity, the model tends to make everyone look similar or blend features between characters.

When This Doesn’t Work

The composite approach has limitations:

Characters without base images can’t participate. If a character only has a LoRA model or a text-only appearance prompt, they can’t be tiled into the composite. In Suzune, 5 out of 49 characters fall into this category — we simply exclude them from selection.

Face blending at high strength. Above 0.60 strength, the model starts mixing facial features between adjacent characters. The narrow crop helps (more spacing between faces), but it’s still a risk.

Fixed aspect ratio. The output must match the composite’s dimensions (1536x1024 for 4 characters). You can’t easily produce a square or tall output without rethinking the tiling.

Maximum ~4-5 characters. At 384px per character, 5 characters = 1920px wide. Beyond that, each character gets too narrow for the model to preserve identity.

Scheduling and Automation

The feature runs as part of Suzune’s daily scheduler:

BREAK_ROOM_HOUR = 18  # JST 6 PM

async def daily_scheduler_loop(...):
    ran_break_room_today = False
    while True:
        await asyncio.sleep(300)
        now = datetime.now(JST)
        if now.hour != BREAK_ROOM_HOUR:
            ran_break_room_today = False
        if now.hour == BREAK_ROOM_HOUR and not ran_break_room_today:
            ran_break_room_today = True
            await _schedule_break_room(slog)

It also supports manual triggering through the GM console (break_room tool) and CLI (python scripts/break_room.py --chars rina,mao,anri).

Time-of-Day Lighting

Since the image is generated at a specific time, we inject time-appropriate lighting into the scene prompt:

TIME_OF_DAY_LIGHTING = {
    "morning": "soft warm morning sunlight, golden hour glow",
    "afternoon": "bright natural daylight, clear sky light",
    "evening": "warm orange sunset lighting, golden hour",
    "night": "city lights in background, cool blue ambient light",
    "late_night": "dim moody lighting, soft warm lamp light only",
}

An 18:00 JST post gets the evening preset — warm sunset light pouring through the cafe windows. A manual midday trigger gets bright daylight. The same code is shared with our single-character image pipeline, ensuring visual consistency across the system.

The Blog-Worthy Takeaway

The composite base image trick is applicable anywhere you need multiple AI-generated characters in one scene:

Dating sim galleries — Show multiple characters side by side
Group chat avatars — Generate a group photo from individual portraits
NPC crowd scenes — Tile background character images and let img2img compose a crowd

The key formula: tile reference images as model input + positional text prompts = coherent multi-character scene at single-image cost.

It’s not perfect — you’re trading individual image quality for cost efficiency and scene coherence. But for a daily automated showcase feature? One API call beats four, every time.

Suzune generates these cafe scenes daily at 6 PM JST. Each image features four characters from our 40+ roster, selected by affection score and randomness, wearing whatever outfit the AI dressed them in that morning. The whole pipeline — selection, composite, generation, posting — takes about 20 seconds and costs two cents.