Cognyzer Cognyzer Terminus-2.0 Docs
18Section 03 · Understanding Tasks

Difficulty Guidelines

Difficulty Guidelines

Task difficulty is determined by accuracy when run against frontier AI models. This guide helps you design tasks at the right difficulty level.

Difficulty Levels

Difficulty Pass Rate Description
Hard <= 20% Requires deep expertise, multi-step reasoning
Medium 21-80% Moderate complexity, some domain knowledge
Easy > 80% Straightforward but still challenging

Important: Tasks with 100% pass rate will NOT be accepted.

Evaluation Process

Each task is evaluated against: - GPT-5.2 with Codex agent - Claude Opus 4.6 with Claude Code agent - 5 runs each to determine average accuracy

The difficulty is based on whichever model performs better.

Designing for Difficulty

For Hard Tasks (<= 20%)

Hard tasks require one or more of:

  • Deep domain expertise - Knowledge LLMs haven't seen much
  • Complex multi-step reasoning - 10+ sequential steps
  • Subtle debugging - Root cause analysis required
  • Niche tools/languages - Less common technologies

Techniques: - Use bespoke rules buried in common patterns - Require understanding of obscure documentation - Create debugging tasks where the root cause isn't obvious - Use domain-specific knowledge (blockchains, scientific computing)

For Medium Tasks (21-80% )

Medium tasks typically involve:

  • Moderate complexity - 5-10 steps
  • Some domain knowledge - Common but not trivial
  • Clear requirements - But non-obvious solution

Techniques: - Combine multiple familiar concepts - Add edge cases that require careful handling - Include configuration that's easy to miss

For Easy Tasks (> 80% )

Easy tasks should still be:

  • Non-trivial - Not one-liners
  • Multi-step - At least 3-5 steps
  • Testable - Clear success criteria

Techniques: - Standard tasks with one or two tricky aspects - Well-known problems with specific constraints - Setup tasks with multiple requirements

Common Mistakes

Making Tasks Too Easy

Mistake Impact
Single-step solutions Agents solve immediately
Obvious debugging Pattern matching succeeds
Common tutorials In training data
Simple API usage Well-documented

Making Tasks Unfair

Mistake Impact
Impossible requirements No one can solve
Ambiguous instructions Luck determines success
Time-dependent Results vary randomly
External dependencies Environment issues

Testing Your Difficulty

Before submitting, verify difficulty by:

1. Run Against Oracle Agent

harbor run -a oracle -p <task-folder>

This should PASS. If it doesn't, your task may have issues.

2. Run Against Real Agents

# GPT-5.2
harbor run -a terminus-2 -m openai/@openai/gpt-5.2 -p <task-folder>

# Claude Opus 4.6
harbor run -a terminus-2 -m anthropic/@anthropic/claude-opus-4-6 -p <task-folder>

Run at least 2-3 times to gauge pass rate.

3. Analyze Failures

When agents fail: - Is it for a good reason (task difficulty)? - Or a bad reason (unclear instructions)?

Good failures: Agent misunderstood complexity, missed edge case Bad failures: Ambiguous requirements, environment issues

Adjusting Difficulty

To Make Harder:

  • Add more steps
  • Include hidden requirements
  • Use niche knowledge
  • Create debugging scenarios
  • Add edge cases

To Make Easier:

  • Reduce step count
  • Make requirements more explicit
  • Use common technologies
  • Provide more hints in instructions
  • Simplify the environment

Next Steps

  • Start creating your task
  • Review CI checks