Eval Landscape (Capabilities)

150+ model releases from 20+ labs. This compilation focuses on foundation models intended for general-purpose use. Notable omissions: image generation, video generation, and speech models.

More context

A model release is a big moment. I was interested to track what labs self-report in release cards as a snapshot of important capabilities that they are currently tracking. Benchmarks naturally have a finite lifespan. When it saturates or stops being useful, it gets retired. Early on, it was all MMLU and math. More recently (as of April 2026), now there's maybe one math benchmark per card, and a lot more focus on agentic and tool-use evaluations. The landscape shifts quickly.

Much of this data was compiled with the help of Claude and other AI coding tools, scraping model cards, extracting scores from PDFs and blog post images, and cross-referencing across sources. The data is not perfect, but the intention is to get a general sense of things. If you spot a major error, let me know.

1. Benchmark Lifecycles

When benchmarks first appeared and last appeared in official model cards. Row color indicates the benchmark's category — toggle "By lab" to color each report dot by the lab that reported it instead.

Sort:
Range:
Dot color: By category By lab

2. Saturation Curves

Best-known scores over time. Each point is a new record. Lines and points are colored by benchmark category — toggle "By lab" to color each data point by the lab that produced the score. The "Frontier" preset is a hand-picked shortlist of benchmarks still being actively reported in recent model cards.

Range:
Dot color: By category By lab

3. Release Cadence

Model releases per lab over time. Hollow dots indicate releases with no public benchmarks. The "Other" row includes smaller labs, startups, and organizations that are no longer active.

Range:

4. Model Card Complexity

Number of distinct benchmarks reported per model release (Y axis). Circle size is the geometric mean size of those benchmarks — the natural average when sizes span several orders of magnitude. Releases with no size-known benchmarks fall back to a neutral small circle.

Range:

5. Typical Benchmark Size by Lab

Geometric mean size (# of items) of benchmarks reported on each model card, for the major frontier labs. Log-scale Y axis. Each dot is one release; dashed line is a least-squares fit per lab. See tools/eval-landscape/NOTES.md for why we use geometric mean.

Range:

Sources

Official model cards, technical reports, and blog posts used to compile this data.

Anthropic: Claude 1.3 / 2 · Instant 1.2 · Claude 3 · 3.5 Sonnet · 3.5v2 / Haiku · 3.7 Sonnet · Claude 4 · 4.5 Sonnet · 4.5 Opus · 4.6 Opus · 4.6 Sonnet · 4.7 Opus

OpenAI: GPT-2 · GPT-3 · ChatGPT · GPT-4 · 4 Turbo · GPT-4o · 4o mini · o1 · o3-mini · GPT-4.1 · o3 / o4-mini · GPT-5 · gpt-oss · 5.1 · 5.2 · 5.3 · 5.4

Google: PaLM · PaLM 2 · Gemini 1.0 · 1.5 Pro / Flash · 2.0 Flash · 2.5 Pro · 2.5 Flash · 3 Pro · 3 Flash · 3.1 Pro · Gemma 1 · 2 · 3 · 4

Mistral: Mistral 7B · Mixtral 8x7B · Large · 8x22B · Codestral · Large 2 · Pixtral 12B · Small 2 · Pixtral Large · Small 3 · Small 3.1 · Medium 3 · Magistral · Large 3 · Devstral 2 · Small 4

Meta: Llama 2 · Llama 3 · 3.1 · 3.2 · 3.3 · Llama 4 · Muse Spark

xAI: Grok-1 · 2 · 3 · 4.1 · 4.20

Alibaba (Qwen): Qwen · 1.5 · 2 · 2.5 · QwQ-32B · 3 · 3.5 · 3.6

Cohere: Command R · R+ · R7B · Command A · A Vision · A Reasoning

Microsoft: Phi-3 · Phi-4 · Phi-4-reasoning

Inflection: Pi · Inflection-2 · Inflection-2.5

01.AI: Yi-34B · Yi-1.5

Baichuan: Baichuan2-13B

InternLM (Shanghai AI Lab): InternLM2.5

Databricks: DBRX

Reka: Reka Core · Edge

NVIDIA: Nemotron-70B · Super-49B

AI2: OLMo-2-32B

Upstage: Solar Pro

Liquid AI: LFM

Xiaomi: MiMo-7B

Baidu: ERNIE 4.5

Writer: Palmyra X5

ByteDance: Seed-2.0

DeepSeek: LLM/Coder · MoE · V2 · V2.5 · V3 · R1 · R1-0528 · V3.1 · V3.2

Z.ai (GLM): GLM-130B · ChatGLM · GLM-4 · CogVLM · GLM-5 · 5.1

Moonshot (Kimi): Kimi Chat · K1.5 · K2 · K2.5

MiniMax: Text-01 · M1 · M2 · M2.5 · M2.7