Eval Landscape (Capabilities)
150+ model releases from 20+ labs. This compilation focuses on foundation models intended for general-purpose use. Notable omissions: image generation, video generation, and speech models.
More context
A model release is a big moment. I was interested to track what labs self-report in release cards as a snapshot of important capabilities that they are currently tracking. Benchmarks naturally have a finite lifespan. When it saturates or stops being useful, it gets retired. Early on, it was all MMLU and math. More recently (as of April 2026), now there's maybe one math benchmark per card, and a lot more focus on agentic and tool-use evaluations. The landscape shifts quickly.
Much of this data was compiled with the help of Claude and other AI coding tools, scraping model cards, extracting scores from PDFs and blog post images, and cross-referencing across sources. The data is not perfect, but the intention is to get a general sense of things. If you spot a major error, let me know.
1. Benchmark Lifecycles
When benchmarks first appeared and last appeared in official model cards. Row color indicates the benchmark's category — toggle "By lab" to color each report dot by the lab that reported it instead.
2. Saturation Curves
Best-known scores over time. Each point is a new record. Lines and points are colored by benchmark category — toggle "By lab" to color each data point by the lab that produced the score. The "Frontier" preset is a hand-picked shortlist of benchmarks still being actively reported in recent model cards.
3. Release Cadence
Model releases per lab over time. Hollow dots indicate releases with no public benchmarks. The "Other" row includes smaller labs, startups, and organizations that are no longer active.
4. Model Card Complexity
Number of distinct benchmarks reported per model release (Y axis). Circle size is the geometric mean size of those benchmarks — the natural average when sizes span several orders of magnitude. Releases with no size-known benchmarks fall back to a neutral small circle.
5. Typical Benchmark Size by Lab
Geometric mean size (# of items) of benchmarks reported on each model card, for the major frontier labs. Log-scale Y axis. Each dot is one release; dashed line is a least-squares fit per lab. See tools/eval-landscape/NOTES.md for why we use geometric mean.
Sources
Official model cards, technical reports, and blog posts used to compile this data.
Anthropic: Claude 1.3 / 2 · Instant 1.2 · Claude 3 · 3.5 Sonnet · 3.5v2 / Haiku · 3.7 Sonnet · Claude 4 · 4.5 Sonnet · 4.5 Opus · 4.6 Opus · 4.6 Sonnet · 4.7 Opus
OpenAI: GPT-2 · GPT-3 · ChatGPT · GPT-4 · 4 Turbo · GPT-4o · 4o mini · o1 · o3-mini · GPT-4.1 · o3 / o4-mini · GPT-5 · gpt-oss · 5.1 · 5.2 · 5.3 · 5.4
Google: PaLM · PaLM 2 · Gemini 1.0 · 1.5 Pro / Flash · 2.0 Flash · 2.5 Pro · 2.5 Flash · 3 Pro · 3 Flash · 3.1 Pro · Gemma 1 · 2 · 3 · 4
Mistral: Mistral 7B · Mixtral 8x7B · Large · 8x22B · Codestral · Large 2 · Pixtral 12B · Small 2 · Pixtral Large · Small 3 · Small 3.1 · Medium 3 · Magistral · Large 3 · Devstral 2 · Small 4
Meta: Llama 2 · Llama 3 · 3.1 · 3.2 · 3.3 · Llama 4 · Muse Spark
xAI: Grok-1 · 2 · 3 · 4.1 · 4.20
Alibaba (Qwen): Qwen · 1.5 · 2 · 2.5 · QwQ-32B · 3 · 3.5 · 3.6
Cohere: Command R · R+ · R7B · Command A · A Vision · A Reasoning
Microsoft: Phi-3 · Phi-4 · Phi-4-reasoning
Inflection: Pi · Inflection-2 · Inflection-2.5
Baichuan: Baichuan2-13B
InternLM (Shanghai AI Lab): InternLM2.5
Databricks: DBRX
NVIDIA: Nemotron-70B · Super-49B
AI2: OLMo-2-32B
Upstage: Solar Pro
Liquid AI: LFM
Xiaomi: MiMo-7B
Baidu: ERNIE 4.5
Writer: Palmyra X5
ByteDance: Seed-2.0
DeepSeek: LLM/Coder · MoE · V2 · V2.5 · V3 · R1 · R1-0528 · V3.1 · V3.2
Z.ai (GLM): GLM-130B · ChatGLM · GLM-4 · CogVLM · GLM-5 · 5.1