Stock image for news article: thousand token wood multi agent economy 3b model

Thousand Token Wood: A Multi-Agent Economy Running on a 3B Model

Alex Chen 5 min read Updated June 5, 2026

TL;DR

  • A developer built a functioning multi-agent economy with five AI traders, each running on Qwen2.5-3B, proving small models can power complex simulations
  • The system generates 100% valid JSON outputs while creating emergent market dynamics including price crashes, wealth inequality, and supply shocks
  • Success came from engineering constraints (scarcity, spoilage, fuel crises) and structured prompting rather than model scale
  • The project demonstrates that small models are reliable format generators but unreliable reasoners—and that gap is closeable with design

What Happened

A Build Small Hackathon submission called Thousand Token Wood ships a real-time economic simulation where five woodland creatures trade goods, gossip, and react to market shocks. Each agent runs on Qwen2.5-3B, served via vLLM on Modal. The system processes all five agents in a single batched GPU call per turn.

The simulation isn’t static. Players can inject historical market events—Tulip Mania, the South Sea Bubble, 1929 bank runs—reskinned as woodland folklore. When a “Run on Oona’s Hoard” rumor hits, the owl agent liquidates honey to raise pebbles, flooding supply and crashing the honey price from 10 to 3 over subsequent turns. None of it is scripted.

The project is fully open, including agent traces showing every prompt, response, and decision. A Gradio interface lets anyone watch the economy unfold in real time.

Why It Matters

This demonstrates small models are viable for multi-agent systems where frontier models are prohibitively slow and expensive. A simulation needing dozens of agent decisions per second cannot afford GPT-4 latency or pricing. The 3B approach makes real-time multi-agent interaction feasible.

The technical lesson cuts deeper: small models emit perfect structure but poor judgment. Qwen2.5-3B returned valid JSON on 100% of calls but made nonsensical decisions—agents tried to buy goods they produced in surplus. The fix wasn’t more parameters. It was telling each agent what it produces, computing its actual shortfalls, and providing one worked example. Decision quality jumped.

This inverts the standard scaling playbook. Instead of throwing compute at reasoning gaps, the developer closed them with prompt engineering and system design. That’s replicable.

Key Details

Model & Infrastructure:

  • Base model: Qwen2.5-3B
  • Serving: vLLM on Modal
  • Interface: Gradio
  • Batch processing: All 5 agents per turn in one GPU call

Performance in a 15-turn run:

MetricResult
Valid JSON actions100% (75/75 calls)
Trades per turn3-9 sustained
Honey price movementCrashed 10→3 during bank run
Firewood price movementRose 4→7 during winter scarcity
Wealth inequality (Gini)Widened 0.14→0.38
Richest agentWoodcutter (monopoly supplier)

Design mechanisms:

  • Diet variety constraint: Agents can only eat one unit of any single food, forcing trade
  • Spoilage: Hoarding perishable goods leads to rot
  • Winter fuel crisis: Rising firewood demand with single supplier creates scarcity
  • Dynamic pricing: Market reference prices drift based on unfilled orders and supply gluts

Implications

The gap between “reliable formatter” and “reliable reasoner” defines the small model design space. Developers building with 3B-7B models should expect perfect syntax and imperfect logic. The win condition is structuring the problem so formatting matters more than raw reasoning.

Emergent behavior requires engineered scarcity. The first version failed because production exceeded consumption—every agent became self-sufficient and trading stopped. Markets only function when participants need what they don’t have. The lesson applies beyond economics: multi-agent systems need built-in resource pressure or they collapse into silence.

The historical event injection mechanism points to a broader pattern. Small models excel at pattern matching and variation on known scenarios. Feeding them real market history as prompts (Tulip Mania, South Sea Bubble) worked better than asking them to invent drama from scratch.

Our Take

This project matters because it proves constraints are features, not bugs. The developer chose a 3B model not despite its limitations but because of them. Frontier models would make this simulation slower and more expensive without improving the core experience.

The bigger implication: multi-agent systems are the killer app for small models. A single GPT-4 agent might outperform five Qwen2.5-3B agents on a reasoning task, but you cannot afford twenty GPT-4 agents making decisions every second. Small models batched on a single GPU can.

Watch the prompt engineering pattern here. The developer didn’t fine-tune or use RAG. They computed each agent’s exact resource state and injected it as structured context. That’s cheaper and faster than retrieval, and it scales to hundreds of agents. If you’re building multi-agent systems, your prompt should be a data pipeline, not a creative writing exercise.

The open traces dataset is the most valuable artifact. Seeing every agent’s reasoning chain—valid JSON wrapping questionable logic—is a field manual for small model development. More projects should ship this level of transparency.

One risk: this demo works because the task domain (supply, demand, price) has clear right and wrong answers that constrain agent decisions. Small models will struggle in domains requiring deeper causal reasoning or long-term planning. The formula isn’t universal, but it’s replicable wherever you can turn judgment calls into structured choices.

Share:

Related Posts