Review Guidelines

Guidelines for peer reviewers evaluating submitted tasks.

Reviewer Checklist

As a helpful tool to use while reviewing, we have a comprehensive reviewer checklist that covers all aspects that should be looked at during a review. This includes descritions of what we are looking for qualitatively, as well as assigns a severity level that establishes if an error for a given aspect of a submission should result in an Accept still, or if it always requires the submission to be sent back for revisions when encountered.

Go to Reviewer Checklist

Review Philosophy

As a reviewer, you're ensuring quality for the benchmark. Your goal is to: - Catch issues before they enter the dataset - Help contributors improve their tasks - Maintain consistency across submissions

Remember: Each review you conduct should be comprehensive and catch ALL present errors and issues with the task. Do not simply find the first error you come across and then immediately send it back for revision. Instead, find all issues during your review and give feedback on all elements found.

If the EC addresses your comments, the task should be ready for acceptance, assuming no new issues appear. In other words, your feedback should be always clear, complete, and actionable.

Review Checklist

1. Read the Task Description

Authentic Prompt Styling

We have overhauled the way instructions are written. In Edition 2, the instruction.md file should index on realistic prompts that real users and engineers would use when interacting with coding agents in their daily life, and be as succinct as possible.

The instructions for every task should adhere to these six general principles: 1. Task instructions must be concise. 2. Task instructions must be well specified. 3. Task instructions must be interesting. 4. Task instruction must not give answers, hints. 5. Task instruction must be unique. 6. Task instruction must use absolute paths.Consult the Prompt Styling Guide for further details on each core principle.

2. Review Tests

Every requirement has a corresponding test
Tests have informative docstrings
Tests verify behavior, not implementation
No brittle string matching
No hardcoded thresholds (or thresholds are reasonable)

3. Use the Test-Quality Eval to Double-Check Your Work

There is an automated test-quality eval on submissions that flags common gaps so you can double-check. It is a helper, not a replacement: still do a full test review, and verify every flag - the eval will miss some issues and sometimes flag acceptable tests.

Flag	Meaning
`req-gap`	Instruction requires something, but no test asserts it.
`weak-assertion`	A test exists but is too loose to catch wrong solutions.
`phantom-spec`	Tests enforce behavior not described in the instruction.
`flaky-execution`	A correct solution can fail due to timing, non-determinism, or infra.
`vacuous-test`	Test can pass no matter the output (e.g. empty loops, always-true checks).

How to use it: Complete your normal test review, then go through any eval flags you might have missed.

4. Check Solution

Solution demonstrates the process (not just outputs answer)
Commands are deterministic
Works in the provided environment

5. Verify Metadata

Difficulty matches expected pass rate
Category is appropriate
Time estimates are realistic
Timeout is sufficient but not excessive

6. Watch Agent Runs

Solvable vs. passing a run: Passing a run means all unit tests pass in a single run. Solvable (for the task) means that across 10 runs, each individual test passes at least once - so a task can have no full pass in any run and still be solvable if the agent only ever “partially” succeeds.

On the task viewer: - Watch the terminal recording - If agent fails, is it failing for a good reason? - Check the analysis for identified issues - Read the debug pane

Comprehensive Reviewer Checklist

You can also use this Reviewer Checklist for a more comprehensive and detailed checklist

Common Issues to Flag

Task Description Problems

Issue	What to Look For
Ambiguous requirements	"Make it better", "fix the issues"
Missing output specs	Tests check files not mentioned
Relative paths	`./data/file.txt` instead of `/app/data/file.txt`
Implicit assumptions	Assumes knowledge not in instructions

Test Problems

Issue	What to Look For
Brittle tests	Exact string matching
Missing coverage	Requirements without tests
Order dependency	Tests that must run in sequence
Implementation testing	Parsing source code

Solution Problems

Issue	What to Look For
Hardcoded answers	`echo "42" > result.txt`
Non-deterministic	Uses random without seed
Incomplete	Missing steps
Over-complex	Unnecessarily convoluted

Specific Things to Watch

Testing Behavior

It is almost always better to test a behavior by actually running the code than trying to statically analyze it.

Flag: Tests that grep through source code looking for patterns.

Tool Specifications

Task instructions shouldn't mention specific tools unless there's a way to verify they were used.

Flag: "Use vim to edit the file" (can't verify vim was used).

Randomness

Tasks involving randomness MUST NOT assume the solution matches that same random order.

Flag: Tests that depend on specific np.random.seed values.

Test Complexity

Long test files are almost always wrong since more tests means more opportunities for error.

Flag: test_outputs.py with 20+ tests for a simple task.

Data Formats

Be especially paranoid about specifying data formats.

Flag: "Output a CSV" without specifying whether it needs headers.

Anti-Cheating

Think about how agents could cheat: - Decompile programs to find hidden answers - Replace programs with dummy versions - Delete or modify tests - Access newer git commits

Review Actions

Approve

Task is ready. No issues found.

Request Changes

Issues need fixing before acceptance. Be specific: - What's wrong - Where it is - How to fix it

Good feedback:

The test test_output_format on line 45 uses exact string matching. Please change to check for required fields instead, allowing for formatting variations.

Bad feedback:

Tests need work.

Extra things you can include - Point the submitter where can they found how to fix the issue or a place where they can see they are mistaken

Check the New: Rubrics section in the trainig site

Decline

Fundamental issues that can't be easily fixed: - Too easy (> 80% pass rate consistently) - Essentially duplicate of existing task - Core concept is flawed

Explain clearly why, and whether revision could salvage it.

Next Steps

Reviewer Checklist - Full high/medium/low acceptance criteria for reviews
Reviewer Training - Training videos and materials for reviewers
Rubrics - How to write and review rubrics for evaluating agent traces
Quality Guidelines - Required quality standards for all tasks
Common errors to watch for