What Makes a Good Task
Creating high-quality tasks is both an art and a science. This guide covers what separates excellent tasks from mediocre ones.
The Core Principle
A good task is one that an expert human can solve confidently, but that challenges or stumps current AI coding agents.
We're not looking for trivia or trick questions. We want genuine engineering challenges that require:
- Multi-step reasoning
- Domain expertise
- Practical problem-solving skills
Key Requirements
1. Difficulty Target
Average pass rate must be < 80% for GPT-5.2 or Claude Opus 4.6
| Difficulty | Pass Rate | Description |
|---|---|---|
| Hard | <= 20% | Requires deep expertise, multi-step reasoning |
| Medium | 21-80% | Moderate complexity, some domain knowledge |
| Easy | > 80% | Straightforward but still challenging |
Tasks with 100% pass rate will NOT be accepted.
2. Multi-Step Complexity
Tasks must require chaining multiple commands, handling intermediate states, and reasoning. Single-command tasks are too easy.
Good: "Debug the failing test suite, fix the three bugs causing failures, and verify all tests pass."
Bad: "Run the test suite." (Too simple)
3. Clear & Unambiguous
The task must be fully specified. An agent should understand exactly what to do without guessing.
Good:
"Implement a function
find_longest_palindrome(s: str) -> strthat returns the longest palindromic substring. If there are ties, return the first one."
Bad:
"Write code for palindromes." (Ambiguous)
4. Testable & Verifiable
Every task must have deterministic tests that verify completion.
def test_longest_palindrome():
assert find_longest_palindrome("babad") in ["bab", "aba"]
assert find_longest_palindrome("cbbd") == "bb"
assert find_longest_palindrome("") == ""
5. No Cheating Opportunities
Think like a malicious agent. Ensure they cannot: - Look inside test files for answers - Edit data files to pass tests - Delete tests to avoid failures - Hardcode expected outputs
How to Make Tasks Harder
Debugging-Style Tasks
When an agent must figure out the root cause of an issue, it inherently requires reasoning.
Niche Knowledge
Use publicly available but rarely-trained knowledge. Example: Blockchain/NFT task
Bespoke Rules
A custom rule buried among common rules confuses agents. Example: Bespoke rule PR
Multi-Step Tasks
Each step has a chance of failure, making overall failure rate higher.
What to Avoid
| Anti-Pattern | Why It's Bad |
|---|---|
| Trivia questions | Tests memorization, not reasoning |
| Ambiguous requirements | Agents can't know what's expected |
| External dependencies | Requires API keys, network, etc. |
| Simple one-liners | Agent solves instantly |
| Brittle tests | String matching, hardcoded values |
Quality Checklist
Before submitting, verify:
- Problem statement is clear and complete
- Difficulty is < 80% pass rate
- Multi-step reasoning required
- All constraints explicitly stated
- Test cases cover all requirements
- No cheating opportunities
- Human-written (not LLM-generated instruction.md)
Next Steps
- Understand task components
- See the task taxonomy
- Review example tasks