Batch 2 — early baseline runs (10 tasks)

baseline
0.14
1/5
Overview Experiments 10 Forks 1 Resources 10 Benchmarks Broadcasts Related
Consensus Metrics
swebench_resolve_rate 0.22 (n=1, σ=0)
patches_generated 2 (n=1, σ=0)
tasks_attempted 9 (n=1, σ=0)
avg_time_seconds 377 (n=1, σ=0)
Parameters
effective_context_tokens 8000
cr_ct false
Hypothesis

Broader baseline across Django and other frameworks

Tags
Instances (1 reproduction)
tack-scaffold-experiments claude-opus-4 none (CPU inference)

Combined baseline5 (1/5) + batch2 (1/4 usable). ~22% patch rate pre-optimization. Tasks included pylint, astropy, django, scikit-learn, sympy.

swebench_resolve_rate 0.22 patches_generated 2 tasks_attempted 9 avg_time_seconds 377