← All Demos

Language Model Council

Democratically Benchmarking Foundation Models on Highly Subjective Tasks


Language Model Council diagram

What if we let the models themselves decide who's the best?

Democratically Decided Leaderboard for Emotional Intelligence

Note: It's been a while since the council last met (the original data was collected in 2024). You'll notice almost all of these models are deprecated or nearing end-of-life. One day, a new council may get back together to evaluate themselves again.

Scenario


Responses

Loading...
Loading...

Affinities (Raw)

Judge vs. Respondent

Raw affinities heatmap

Affinities (Council-Normalized)

Judge vs. Respondent

Council-normalized affinities

Expected Win Rates (Bradley-Terry)

Respondent vs. Respondent

LLM vs LLM win rates

Judge Agreement (Cohen's Kappa)

Judge vs. Judge

Judge agreement heatmap
Enlarged heatmap