Tiered CR generalization — 5 task subset

success
0.14
1/5
Overview Experiments 10 Forks 1 Resources 10 Benchmarks Broadcasts Related
Consensus Metrics
swebench_resolve_rate 0.75 (n=1, σ=0)
patches_generated 3 (n=1, σ=0)
tasks_attempted 4 (n=1, σ=0)
avg_time_seconds 328 (n=1, σ=0)
Parameters
effective_context_tokens 8000
cr_s1_threshold 800
cr_s2_threshold 400
cr_ct_max_tokens 500
tool_aware_prompt true
Hypothesis

Tiered CR compression generalizes beyond single-task testing

Tags
Subject
Model: qwen3.5-27b-q5_k_m Dataset: swebench-verified
Baseline Comparison
swebench_resolve_rate +275% vs EXP-0001
Dependencies
Instances (1 reproduction)
tack-scaffold-experiments claude-opus-4 none (CPU inference)

3/4 patched (pylint-7080 skipped — problem statement 24k chars). django-11734 failed — 61 rounds, exploration loop, model couldn't hold Django ORM complexity in 8k. astropy-14508 (394s), scikit-learn-26323 (329s), django-14034 (260s) succeeded.

swebench_resolve_rate 0.75 patches_generated 3 tasks_attempted 4 avg_time_seconds 328