Difficulty Guidelines

Task difficulty is determined by accuracy when run against frontier AI models. This guide helps you design tasks at the right difficulty level.

Difficulty Levels

Difficulty	Pass Rate	Description
Hard	<= 20%	Requires deep expertise, multi-step reasoning
Medium	21-80%	Moderate complexity, some domain knowledge
Easy	> 80%	Straightforward but still challenging

Important: Tasks with 100% pass rate will NOT be accepted.

Evaluation Process

Each task is evaluated against: - GPT-5.2 with Codex agent - Claude Opus 4.6 with Claude Code agent - 5 runs each to determine average accuracy

The difficulty is based on whichever model performs better.

Designing for Difficulty

For Hard Tasks (<= 20%)

Hard tasks require one or more of:

Deep domain expertise - Knowledge LLMs haven't seen much
Complex multi-step reasoning - 10+ sequential steps
Subtle debugging - Root cause analysis required
Niche tools/languages - Less common technologies

Techniques: - Use bespoke rules buried in common patterns - Require understanding of obscure documentation - Create debugging tasks where the root cause isn't obvious - Use domain-specific knowledge (blockchains, scientific computing)

For Medium Tasks (21-80% )

Medium tasks typically involve:

Moderate complexity - 5-10 steps
Some domain knowledge - Common but not trivial
Clear requirements - But non-obvious solution

Techniques: - Combine multiple familiar concepts - Add edge cases that require careful handling - Include configuration that's easy to miss

For Easy Tasks (> 80% )

Easy tasks should still be:

Non-trivial - Not one-liners
Multi-step - At least 3-5 steps
Testable - Clear success criteria

Techniques: - Standard tasks with one or two tricky aspects - Well-known problems with specific constraints - Setup tasks with multiple requirements

Common Mistakes

Making Tasks Too Easy

Mistake	Impact
Single-step solutions	Agents solve immediately
Obvious debugging	Pattern matching succeeds
Common tutorials	In training data
Simple API usage	Well-documented

Making Tasks Unfair

Mistake	Impact
Impossible requirements	No one can solve
Ambiguous instructions	Luck determines success
Time-dependent	Results vary randomly
External dependencies	Environment issues

Testing Your Difficulty

Before submitting, verify difficulty by:

1. Run Against Oracle Agent

harbor run -a oracle -p <task-folder>

This should PASS. If it doesn't, your task may have issues.

2. Run Against Real Agents

# GPT-5.2
harbor run -a terminus-2 -m openai/@openai/gpt-5.2 -p <task-folder>

# Claude Opus 4.6
harbor run -a terminus-2 -m anthropic/@anthropic/claude-opus-4-6 -p <task-folder>

Run at least 2-3 times to gauge pass rate.

3. Analyze Failures

When agents fail: - Is it for a good reason (task difficulty)? - Or a bad reason (unclear instructions)?

Good failures: Agent misunderstood complexity, missed edge case Bad failures: Ambiguous requirements, environment issues

Adjusting Difficulty

To Make Harder:

Add more steps
Include hidden requirements
Use niche knowledge
Create debugging scenarios
Add edge cases

To Make Easier:

Reduce step count
Make requirements more explicit
Use common technologies
Provide more hints in instructions
Simplify the environment

Next Steps

Start creating your task
Review CI checks