Monte Carlo Simulator: Why Balanced Accuracy Wins
This simulator answers: which binary classification metric is best for selecting an LLM judge that will most accurately rank models by prevalence?
It generates random scenarios with multiple models and judges, determines which judge is objectively best at ranking models, then checks whether each metric (Balanced Accuracy, Accuracy, F1+, Macro F1) would have selected that best judge using only a golden validation set.
Based on Collot et al., "Balanced Accuracy: The Right Metric for Evaluating LLM Judges" (EACL 2026).
Models
Judges
Simulation
0 / 1000
Metric Selection Success Rate
Balanced Accuracy
--
Accuracy
--
F1+
--
Macro F1
--
Average Rank Gap Loss
Balanced Accuracy
--
Accuracy
--
F1+
--
Macro F1
--