Multi-Agent Economy Built on 3B Model (2026)

TL;DR

A developer built a functioning multi-agent economy with five AI traders, each running on Qwen2.5-3B, proving small models can power complex simulations
The system generates 100% valid JSON outputs while creating emergent market dynamics including price crashes, wealth inequality, and supply shocks
Success came from engineering constraints (scarcity, spoilage, fuel crises) and structured prompting rather than model scale
The project demonstrates that small models are reliable format generators but unreliable reasoners—and that gap is closeable with design

What Happened

A Build Small Hackathon submission called Thousand Token Wood ships a real-time economic simulation where five woodland creatures trade goods, gossip, and react to market shocks. Each agent runs on Qwen2.5-3B, served via vLLM on Modal. The system processes all five agents in a single batched GPU call per turn.

The simulation isn’t static. Players can inject historical market events—Tulip Mania, the South Sea Bubble, 1929 bank runs—reskinned as woodland folklore. When a “Run on Oona’s Hoard” rumor hits, the owl agent liquidates honey to raise pebbles, flooding supply and crashing the honey price from 10 to 3 over subsequent turns. None of it is scripted.

The project is fully open, including agent traces showing every prompt, response, and decision. A Gradio interface lets anyone watch the economy unfold in real time.

Why It Matters

This demonstrates small models are viable for multi-agent systems where frontier models are prohibitively slow and expensive. A simulation needing dozens of agent decisions per second cannot afford GPT-4 latency or pricing. The 3B approach makes real-time multi-agent interaction feasible.

The technical lesson cuts deeper: small models emit perfect structure but poor judgment. Qwen2.5-3B returned valid JSON on 100% of calls but made nonsensical decisions—agents tried to buy goods they produced in surplus. The fix wasn’t more parameters. It was telling each agent what it produces, computing its actual shortfalls, and providing one worked example. Decision quality jumped.

This inverts the standard scaling playbook. Instead of throwing compute at reasoning gaps, the developer closed them with prompt engineering and system design. That’s replicable.

Key Details

Model & Infrastructure:

Base model: Qwen2.5-3B
Serving: vLLM on Modal
Interface: Gradio
Batch processing: All 5 agents per turn in one GPU call

Performance in a 15-turn run:

Metric	Result
Valid JSON actions	100% (75/75 calls)
Trades per turn	3-9 sustained
Honey price movement	Crashed 10→3 during bank run
Firewood price movement	Rose 4→7 during winter scarcity
Wealth inequality (Gini)	Widened 0.14→0.38
Richest agent	Woodcutter (monopoly supplier)

Design mechanisms:

Diet variety constraint: Agents can only eat one unit of any single food, forcing trade
Spoilage: Hoarding perishable goods leads to rot
Winter fuel crisis: Rising firewood demand with single supplier creates scarcity
Dynamic pricing: Market reference prices drift based on unfilled orders and supply gluts

Implications

The gap between “reliable formatter” and “reliable reasoner” defines the small model design space. Developers building with 3B-7B models should expect perfect syntax and imperfect logic. The win condition is structuring the problem so formatting matters more than raw reasoning.

Emergent behavior requires engineered scarcity. The first version failed because production exceeded consumption—every agent became self-sufficient and trading stopped. Markets only function when participants need what they don’t have. The lesson applies beyond economics: multi-agent systems need built-in resource pressure or they collapse into silence.

The historical event injection mechanism points to a broader pattern. Small models excel at pattern matching and variation on known scenarios. Feeding them real market history as prompts (Tulip Mania, South Sea Bubble) worked better than asking them to invent drama from scratch.

Our Take

This project matters because it proves constraints are features, not bugs. The developer chose a 3B model not despite its limitations but because of them. Frontier models would make this simulation slower and more expensive without improving the core experience.

The bigger implication: multi-agent systems are the killer app for small models. A single GPT-4 agent might outperform five Qwen2.5-3B agents on a reasoning task, but you cannot afford twenty GPT-4 agents making decisions every second. Small models batched on a single GPU can.

Watch the prompt engineering pattern here. The developer didn’t fine-tune or use RAG. They computed each agent’s exact resource state and injected it as structured context. That’s cheaper and faster than retrieval, and it scales to hundreds of agents. If you’re building multi-agent systems, your prompt should be a data pipeline, not a creative writing exercise.

The open traces dataset is the most valuable artifact. Seeing every agent’s reasoning chain—valid JSON wrapping questionable logic—is a field manual for small model development. More projects should ship this level of transparency.

One risk: this demo works because the task domain (supply, demand, price) has clear right and wrong answers that constrain agent decisions. Small models will struggle in domains requiring deeper causal reasoning or long-term planning. The formula isn’t universal, but it’s replicable wherever you can turn judgment calls into structured choices.