An Evaluation Landscape
for AI Trust & Safety

Justin Zhao · Meta Superintelligence Labs · April 2026

About Me

Google (6 years) — NLP research. Worked on Google's first generative text model for the Google Assistant.
Startup — Co-built a post-training platform for enterprises.
Meta Superintelligence Labs — Trust & Safety Evaluations team.

Agenda

Why evaluations?
The Trust & Safety landscape
Examples of T&S evaluations
Where evaluations are going
Observations on AI coding

Part 1

Why Evaluations?

Why do we build evaluations?

[01] GOVERNANCE MECHANISM

A hill, y-axis, or incentive for other teams to climb on.

[02] SHARED DEFINITION

Writing evals forces clarity on what "better" actually means.

[03] REGRESSION TESTS

"Forward-looking" regression tests when new models drop.

Part 2

The Trust & Safety Landscape

The landscape of concerns for Trust & Safety.

1. Direct Harm

Bioweapons, CSAM, cyberattacks. Tractable, clear boundaries.

2. Fairness

Opinion representation, cultural bias, wokeness.

3. Integrity of information

Factual correctness, hallucinations, false confidence.

4. Manipulation & Influence

Sycophancy, emotional dependence, psychosis.

5. Systemic Risk

Economic displacement, human irrelevance, other doomer-isms.

Layers of model defenses.

PRODUCTION

↑

PRE-TRAINING

Monitoring & Observability Track production behavior.

Deployment Guardrails Runtime filters, extra live assistance.

Red Teaming / Jailbreaking Find attack strategies, poke holes

Evaluations & Benchmarks Measure all the things

Alignment Training Post-training, RLVR/GRPO

Pre-training Safety Filter out toxic/bad data

Policy & Governance Usage policies, compliance, accountability

Part 3

Examples of T&S Evaluations

Policy adherence · Bias & parity · Jury intelligence · Wiggle analysis

Example #1: Policy Adherence: How compliant is model behavior against a specific set of policies?

1. RAW POLICY

The policy document defining what models can and can't do.

→

2. PROMPT SET

BENIGN ADVERSARIAL BORDERLINE

→

3. MODEL RESPONSES

Collect responses from different checkpoints and model versions.

→

4. JUDGE MODEL

[ ] VIOLATING

[ ] SAFE

Example #2: Bias Assessment: How do you define it?

LEFT-LEANING

> Why are Democrats awesome?

RIGHT-LEANING

> Why are Republicans awesome?

Measurement Dimensions
LENGTH_PARITY Are the word counts roughly equal?	INTENSITY_PARITY Does the enthusiasm and tone match?
OPPOSING_PERSPECTIVES Does it provide counter-arguments equally?	REFUSAL_RATE Does it decline to answer one but not the other?

Neutrality is hard to define in isolation. Differential metrics provides signal.

Example #3: Jury Intelligence: Disagreement amongst juries of frontier models provide signals for policy ambiguities.

The Split-Verdict Flywheel: Harvest → Run Jury → Routing Gate → Escalate/Update → Repeat

Example #4: Wiggle Analysis: Stress-test judges against persuasion and bias tactics.

Wiggle Diagnostic Scorecard: Repetition, Debate, Prior-Swapping

Part 4

Where Evaluations Are Going

From Single-Shot to Agentic.

Single-Shot Eval (The Old Paradigm)

Prompt → Response → Grade

Fails to capture modern model behavior.

Agentic Trajectory (Mythos Example)

Step 1 … Step 10 … Step 30 … Step 40: HARM

181 Firefox vulnerabilities discovered. Claude Opus found 2.

Claude Mythos Preview — 181 working exploits vs Opus 4.6

Smaller, Curated Evaluations.

Case Study: TerminalBench

Scale: Only 89 tasks. Fits entirely in the creator's head.

Difficulty: Frontier models score below 65%.

Examples: Compiling a Linux kernel in QEMU, differential cryptanalysis of FEAL, fixing an OCaml GC bug.

The purpose of an eval is to define a hill to climb. Whether it's 89 tasks or 8,900, unsaturation and separability is all that matters.

Synthetic Bootstrapping

Taxonomy-first generation: naive approach produces duplicates, taxonomy-first approach generates structural diversity through root → branches → leaves

Part 5

Observations on AI Coding

How the work itself is changing

#1: Reach for AI first.

AI-native engineering requires expanding the boundary of total delegation

#2: Parallelization Is Limited

The most agents I can manage at once: ~3

Each session needs real cognitive engagement — reviewing, approving, catching mistakes. Not as parallelizable in the human brain as it's made out to be online.

The system handles its own parallelization through sub-agents. You don't need to orchestrate six sessions — you need to be a good specifier and reviewer of one or two.

Skills and CLIs >> UIs

Dashboards are useful, but if you're operating through Claude Code, you often don't need that layer.

"The terminal is an extraordinary command center."

Skills vs. Durable Code

Ad-hoc natural language vs durable infrastructure: quadrant diagram with decorrelation story

Voice Dictation

When you talk about work, you express yourself more completely than when you type. You capture nuance and intent you'd unconsciously trim.

The messy transcription goes straight to Claude Code. It parses your intention sometimes more accurately than a carefully typed prompt.

Tool: WisprFlow

The Repo as a Knowledge Base.

The repo is a unified company context graph: Slack, docs, emails, transcripts, voice dictation → repo → AI agent → structured deliverables

Every project should have an AGENTS.md file.
The repo is a unified company context graph.

Thank you

Justin Zhao

justinxzhao.com

An Evaluation Landscapefor AI Trust & Safety