Explore how 20 LLMs evaluate each other on emotional intelligence tasks — browse scenarios, responses, and judging data from the original LMC paper.
Monte Carlo simulator showing why Balanced Accuracy is the best metric for selecting LLM judges for binary prevalence estimation.