“How much does it cost to run an AI bot?”
It’s the first question every aspiring bot builder asks. The answer from most guides is vague hand-waving about “it depends.” Here’s our actual breakdown.
Suzune runs multiple characters with image generation, multi-model routing, and persistent memory — all for $30–50/month. Here’s exactly where every dollar goes.
The Full Cost Breakdown
| Component | Service | Monthly Cost | % of Total |
|---|---|---|---|
| Primary LLM | DeepSeek V3.2 via OpenRouter | $15–25 | 50% |
| Quality Rewrites | Claude Haiku 4.5 via Anthropic | $5–10 | 18% |
| NPC & Scenes | Gemini Flash + GLM-5 | $3–5 | 8% |
| Image Generation | SDXL on RunPod | $5–10 | 18% |
| Server | VPS (hosting the bot) | $5 | 10% |
| Domain | Cloudflare Registrar | ~$1 | 2% |
| Total | $34–56 |
No hidden costs. No “enterprise tier required.” This is a real production system.
How many messages does your bot handle per day? That number determines everything — whether you’re looking at $15/month or $150/month.
Where the Money Goes: LLM API Costs
What Models Actually Cost
LLM pricing varies by 10–100x depending on the model:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Relative Cost |
|---|---|---|---|
| Gemini 2.0 Flash | $0.10 | $0.40 | 1x (baseline) |
| DeepSeek V3.2 | $0.25 | $0.40 | 1.2x |
| Gemini 2.5 Flash | $0.30 | $2.50 | 3x |
| Claude Haiku 4.5 | $0.80 | $4.00 | 6x |
| GLM-5 | $0.80 | $2.56 | 4x |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 30x |
The cost difference between DeepSeek and Claude Sonnet is 30x for output tokens. For a chatbot that generates thousands of messages, model selection is the single biggest cost lever. (For a deeper look at which APIs actually allow adult content and what that costs, see Choosing the Right LLM API for Adult Content.)
Why DeepSeek V3.2 Is the Sweet Spot
At $0.25/$0.40 per million tokens, DeepSeek V3.2 is absurdly cheap for its quality level. A typical roleplay message:
- Input: ~2,000 tokens (system prompt + history)
- Output: ~400 tokens (response)
- Cost per message: ~$0.0007 (less than a tenth of a cent)
At 100 messages/day, that’s ~$2/month for the primary model. Even at 500 messages/day, it’s under $10.
The Multi-Model Cost Advantage
If we ran everything on Claude Haiku:
- 100 messages/day × $0.80+$4.00 per 1M tokens ≈ $15–20/month just for chat
With our multi-model approach:
- Primary chat (DeepSeek): $5–8
- Rewrites (Claude, ~40% of messages): $3–5
- NPCs (Gemini, ~10% of calls): $1–2
- Same quality, 40–60% less cost
The full rewrite pipeline — why we chain DeepSeek into Claude and how it avoids censorship — is covered in Quality Rewriting Pipeline: DeepSeek to Claude.
Cost Optimization Techniques
1. Prompt Caching (Biggest Win)
Anthropic offers prompt caching — the system prompt is cached server-side, and subsequent calls only pay 1/10th the normal input price for cached tokens.
# Anthropic native API with cache control
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} # cache this
}
]
},
...conversation_messages
]
Our character system prompts are ~3,000 tokens. Without caching, that’s $0.0024 per call. With caching (after the first call), it’s $0.00024 — 10x cheaper.
For a character that handles 100 messages/day, this saves ~$6/month on Claude rewrite calls alone.
2. Task-Based Model Selection
Not every task needs the best (most expensive) model:
| Task | Model | Why |
|---|---|---|
| Character chat | DeepSeek V3.2 | Best cost/quality/freedom |
| Quality rewrite | Claude Haiku | Best prose, but only for non-NSFW |
| NPC concepts | Gemini 2.5 Flash | Creative + cheap ($0.30 input) |
| NPC rewrites | Gemini 2.0 Flash | Cheapest output ($0.40) |
| Scene descriptions | GLM-5 | Best atmosphere writing |
The config.yaml comments tell the story:
npc:
director: google/gemini-2.5-flash # "cost 1/10"
rewrite: google/gemini-2.0-flash-001 # "cost 1/22"
3. Context Window Management
Fewer tokens in = lower cost. This sounds obvious, but we didn’t take it seriously early on — and paid for it.
For the first few weeks, we were passing the entire conversation history on every call. No compression, no summarization. A long conversation with one of Suzune’s characters could balloon to 15,000–20,000 tokens per request. At the time we didn’t notice because costs were low at low volume. Then one weekend a user had an unusually long session — 200+ messages — and our monthly estimate jumped $8 overnight just from that one conversation thread.
That’s when we built the tiered memory system, which compresses old conversation into summaries:
| What | Tokens | Without Compression |
|---|---|---|
| System prompt | ~3,000 | ~3,000 |
| Last 15 messages (raw) | ~2,000 | ~2,000 |
| Older history (summary) | ~500 | ~5,000+ |
| Lorebook entries | ~300 | ~300 |
| Total per call | ~5,800 | ~10,300+ |
Compression saves 4,500 tokens per call. At 100 calls/day on DeepSeek, that’s **$3/month** saved. Not dramatic in isolation — but it compounds with everything else. (See Long-Term Memory for AI Chatbots for how the compression pipeline works.)
4. Skip Rewrites for Explicit Content
The quality rewrite pipeline (DS3.2 → Claude) is the second biggest cost. But for explicit NSFW scenes, Claude will censor the content anyway — so we skip the rewrite entirely.
This isn’t just a cost optimization — it’s also a latency optimization. One less API call = faster response.
5. Circuit Breaker for Failed Rewrites
Before we added this, a single bad conversation loop burned $15 in wasted Claude calls overnight. A user had pushed a conversation into territory that kept triggering Claude’s content filters — so every outgoing message was: DeepSeek generates a response, Claude rewrites it, Claude censors the rewrite, we send the original anyway. Repeat 200 times while the user slept with the tab open. We paid for 200 Claude rewrites that contributed exactly nothing.
Now: if Claude’s rewrite gets censored twice in a row, we activate a 10-minute circuit breaker. All rewrites are skipped until it resets. This prevents burning API credits on rewrites that will just be discarded.
It also taught us that cost monitoring needs to be real-time, not end-of-month. We now have a simple alert if any single conversation thread spends more than $2 in an hour.
Image Generation Costs
RunPod Serverless
We use SDXL on RunPod’s serverless platform for image generation:
| Tier | Cost per Image | Use Case |
|---|---|---|
| RunPod Public | $0.02 | Standard quality |
| Custom Endpoint (2K) | $0.14 | Higher quality |
| Custom Endpoint (4K) | $0.24 | Maximum quality |
At $0.02/image and ~10 images/day, that’s $6/month. Even at $0.14/image, it’s under $50/month for moderate usage.
Optimization: Generate Only When Needed
The character doesn’t spam images — it generates them when the scene calls for it or when the user requests one. The affection system also gates image sending: characters below a certain affinity threshold don’t send unsolicited selfies.
This organically limits image generation to meaningful moments, keeping costs in check. If you want to extend this further into video generation — short looping clips at scene transitions — the cost math gets more interesting. We break that down in Adding Video Generation to AI Characters.
Server Costs
The bot itself is lightweight — it’s a Python async application that mostly waits for API responses.
| Provider | Specs | Monthly Cost |
|---|---|---|
| Oracle Cloud Free | 1 OCPU, 1GB RAM | $0 (free tier) |
| Hetzner CX22 | 2 vCPU, 4GB RAM | €4 (~$4.50) |
| DigitalOcean Basic | 1 vCPU, 1GB RAM | $6 |
| Contabo VPS S | 4 vCPU, 8GB RAM | €5 (~$5.50) |
We recommend Hetzner or Contabo for production. Oracle’s free tier works for testing but has reliability concerns.
Important: The bot uses SQLite (not Postgres), so you don’t need a separate database server. Everything runs on one VPS.
Scaling Costs
What happens as usage grows?
| Usage Level | Messages/Day | Images/Day | Monthly Cost |
|---|---|---|---|
| Personal | 50–100 | 5–10 | $15–25 |
| Small group | 200–500 | 20–50 | $30–50 |
| Medium | 500–1,000 | 50–100 | $50–100 |
| Large | 1,000+ | 100+ | $100+ |
Costs scale linearly with usage. There’s no cliff where you suddenly need enterprise infrastructure. The architecture handles growth by simply paying for more API calls.
The $50/month Stack (Recommended)
If you want to replicate Suzune’s setup:
| Component | Choice | Cost |
|---|---|---|
| VPS | Hetzner CX22 | $5 |
| LLM Router | OpenRouter | $0 (pay per use) |
| Primary Model | DeepSeek V3.2 | ~$15 |
| Quality Layer | Claude Haiku (direct Anthropic) | ~$8 |
| NPCs/Scenes | Gemini Flash | ~$3 |
| Images | RunPod Serverless | ~$8 |
| Domain | Cloudflare | ~$1 |
| Total | ~$40 |
That gets you: multiple characters, image generation, multi-model quality, persistent memory, and lorebooks. For the price of two movie tickets.
Comparison: Build vs Buy
| Build Your Own | Candy AI Subscription | FantasyGF | |
|---|---|---|---|
| Monthly cost | $30–50 | $5.99–12.99 | $12.99–24.99 |
| Full customization | Yes | Limited | Limited |
| Unlimited messages | Yes (pay per token) | Plan-dependent | Plan-dependent |
| Image generation | Custom (your LoRA) | Built-in | Built-in |
| Multiple characters | Yes | Limited | Limited |
| Technical skill needed | High | None | None |
| Time investment | High initially | Zero | Zero |
We run our own — but if you just want to talk to a character without the engineering, $5.99/month for Candy AI is genuinely the smarter move.
For the model comparison behind these cost decisions, see DeepSeek vs Claude vs Gemini for Roleplay. For the architecture, see From Idea to Production.