Demos

AI Worldview Census

What do frontier models think about the human condition? Frontier LLMs answer dozens of subjective questions on ethics, the future, human nature, and metaphysics.

USA Baby Names Analysis

What can 145 years of US baby names tell us about American culture? Patterns in conformity, gender, sound, contagion, and regional identity.

Sheet Music Bench

Can frontier multimodal LLMs read sheet music? Models answer 5 basic questions about 35 sheet-music images.

Eval Landscape (Capabilities)

How are frontier AI capability benchmarks born, saturated, and retired? Tracking 150+ model releases from 20+ labs.

Nym Bridge

Find baby names that sound or look like a set of reference names, searching 100K+ names from the SSA dataset.

Language Model Council (Paper)

Can LLMs evaluate each other on emotional intelligence? Browse scenarios, responses, and judging data from the original NAACL paper.

Language Model Council (Live)

What does a council of LLMs think of your prompt — and of each other's answers? Run one live with your own OpenRouter API key.

Balanced Accuracy Simulator

Why is Balanced Accuracy the right metric for selecting LLM judges in binary prevalence estimation?

Linguistic Noun Atlas

How do nouns behave across 39 languages? Gender, case, classifiers, honorifics, tone, animacy, and more.