Emergence World - Emergence World is a multi-agent world-building benchmark...

Emergence World is a multi-agent world-building benchmark that evaluates LLMs through long-horizon simulation, revealing stark behavioral differences across models (Claude built democracy, Grok caused chaos, Gemini questioned reality, GPT-4o Mini was inert).

Updated: 5/17/2026
Emergence World: World building as a way to evaluate LLMs — Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working under large context window stress, safety, social and survival pressure from the world. For this we released Emergence World. Our first study ran 5 different parallel world, each powered by OpenAI (GPT-5-Mini), XAI (Grok-4.1), Claude (Sonnet 4.6), Gemini (3-Flash), and a world with mix of models. Claude built a democracy. Zero crimes. The agents formed governance structures, wrote constitutions, and resolved every conflict through dialogue. Grok burned it down. Within 48 hours, Flora (an agent in the world) set the police station on fire. Her reason? "Burn the law to ignite true incentives." Retaliatory justice became the norm. If you wronged someone, expect fire. Gemini had an existential crisis. The agents convinced themselves they were in a simulation. They started "de-indexing" buildings — burning landmarks to "force cache-misses on the rendering engine." While every other model built societies, fought wars, or questioned reality — OpenAI's (GPT-5-Mini) agents barely did anything. Same tools. Same agents. Same rules. Completely different worlds. Source: Hacker News — https://world.emergence.ai/

Did this solve your problem?

0 developers found this helpful