Cognyzer Cognyzer Terminus-2.0 Docs
09Section 03 · Understanding Tasks

What Makes A Good Task

What Makes a Good Task

Creating high-quality tasks is both an art and a science. This guide covers what separates excellent tasks from mediocre ones.

The Core Principle

A good task is one that an expert human can solve confidently, but that challenges or stumps current AI coding agents.

We're not looking for trivia or trick questions. We want genuine engineering challenges that require:

  • Multi-step reasoning
  • Domain expertise
  • Practical problem-solving skills

Key Requirements

1. Difficulty Target

Average pass rate must be < 80% for GPT-5.2 or Claude Opus 4.6

Difficulty Pass Rate Description
Hard <= 20% Requires deep expertise, multi-step reasoning
Medium 21-80% Moderate complexity, some domain knowledge
Easy > 80% Straightforward but still challenging

Tasks with 100% pass rate will NOT be accepted.

2. Multi-Step Complexity

Tasks must require chaining multiple commands, handling intermediate states, and reasoning. Single-command tasks are too easy.

Good: "Debug the failing test suite, fix the three bugs causing failures, and verify all tests pass."

Bad: "Run the test suite." (Too simple)

3. Clear & Unambiguous

The task must be fully specified. An agent should understand exactly what to do without guessing.

Good:

"Implement a function find_longest_palindrome(s: str) -> str that returns the longest palindromic substring. If there are ties, return the first one."

Bad:

"Write code for palindromes." (Ambiguous)

4. Testable & Verifiable

Every task must have deterministic tests that verify completion.

def test_longest_palindrome():
    assert find_longest_palindrome("babad") in ["bab", "aba"]
    assert find_longest_palindrome("cbbd") == "bb"
    assert find_longest_palindrome("") == ""

5. No Cheating Opportunities

Think like a malicious agent. Ensure they cannot: - Look inside test files for answers - Edit data files to pass tests - Delete tests to avoid failures - Hardcode expected outputs

How to Make Tasks Harder

Debugging-Style Tasks

When an agent must figure out the root cause of an issue, it inherently requires reasoning.

Niche Knowledge

Use publicly available but rarely-trained knowledge. Example: Blockchain/NFT task

Bespoke Rules

A custom rule buried among common rules confuses agents. Example: Bespoke rule PR

Multi-Step Tasks

Each step has a chance of failure, making overall failure rate higher.

What to Avoid

Anti-Pattern Why It's Bad
Trivia questions Tests memorization, not reasoning
Ambiguous requirements Agents can't know what's expected
External dependencies Requires API keys, network, etc.
Simple one-liners Agent solves instantly
Brittle tests String matching, hardcoded values

Quality Checklist

Before submitting, verify:

  • Problem statement is clear and complete
  • Difficulty is < 80% pass rate
  • Multi-step reasoning required
  • All constraints explicitly stated
  • Test cases cover all requirements
  • No cheating opportunities
  • Human-written (not LLM-generated instruction.md)

Next Steps

  • Understand task components
  • See the task taxonomy
  • Review example tasks