Against hand-wavy AI evaluations

A small AI lab in public. Notes, tools, and experiments.

Most AI claims lean on cherry-picked prompts and tiny samples. Here’s a minimal protocol for honest evaluation: define the task, use real data, add baselines, and publish errors.

Impressive demos are easy. Honest evaluation is not. Many claims rest on five cherry-picked prompts or vague criteria...