Small-Model Agent Scaffold Optimization

Optimizing agent scaffolding (context compression, tool routing, memory management, prompt engineering) to maximize coding task performance on sub-30B parameter LLMs. Primary model: Qwen3.5-27B. Evaluation: SWE-bench Verified. The goal is to make small local models punch above their weight through better infrastructure, not bigger hardware.

Created by @buun Created 2026-03-27T05:38:05Z

Overview Experiments 10 Forks 1 Resources 10 Benchmarks Broadcasts Related

Showing 16 experiments

ID	Title / Hypothesis	Result	Confidence	Reproductions	Metrics
cexp_12676d	Initial CR/CT implementation — v7 CR/CT round compression reduces context pressure and improves solve time	inconclusive	0.38	1/5	swebench_resolve_ratetime_to_solve_secondspatch_chars
cexp_1801f3	Emergency compress order flip — v7b Flipping emergency compress order (CT collapse first, then tool result compression) preserves recent verbatim results longer	inconclusive	0.38	1/5	swebench_resolve_ratetime_to_solve_secondspatch_chars
cexp_1b69c4	Pre-scaffold baseline — v6 generalization (5 tasks) Establish pre-CR/CT baseline resolve rate across diverse SWE-bench tasks	baseline	0.38	1/5	swebench_resolve_ratepatches_generatedtasks_attemptedavg_time_seconds
cexp_1f7f95	Tiered CR compression (S1/S2/S3) — v9 Different reasoning lengths need different compression levels. S1 (4-6 sentences, ≥800ch), S2 (2-3 sentences, ≥400ch), S3 (1 sentence, <400ch) — all generated in one LLM call, code picks appropriate tier	success	0.38	1/5	swebench_resolve_ratetime_to_solve_secondspatch_chars
cexp_39fcd1	Baseline — Tiered CR compression + TurboQuant KV cache at 8k effective context Establish baseline performance with tiered CR compression (S1/S2/S3), CT emergency collapse, and 8k effective context window on Qwen3.5-27B	success	0.38	1/5	swebench_resolve_ratepatches_generatedtasks_attemptedavg_time_seconds
cexp_a3eb8e	Effective context 10k with TurboQuant CUDA KV cache Increasing effective context to 10k with TurboQuant KV cache improves performance by giving the model more working memory	inconclusive	0.38	1/5	swebench_resolve_ratetime_to_solve_secondspatch_chars
cexp_a45c53	Tool-aware CR/CT prompt — v8 Tool-type-aware compression prompts preserve more useful information (verbatim code for reads, line ranges for edits, key output for commands)	success	0.38	1/5	swebench_resolve_ratetime_to_solve_secondspatch_chars
cexp_140dcd	Batch 2 — early baseline runs (10 tasks) Broader baseline across Django and other frameworks	baseline	0.14	1/5	swebench_resolve_ratepatches_generatedtasks_attemptedavg_time_seconds
cexp_2a9b92	Tiered CR compression (S1/S2/S3) Different reasoning lengths need different compression levels. S1 (4-6 sentences, ≥800ch), S2 (2-3 sentences, ≥400ch), S3 (1 sentence, <400ch) — all generated in one LLM call, code picks appropriate tier based on original reasoning length	success	0.14	1/5	swebench_resolve_ratetime_to_solve_secondspatch_charsrounds
cexp_2c4afe	Tiered CR generalization — 5 task subset Tiered CR compression generalizes beyond single-task testing	success	0.14	1/5	swebench_resolve_ratepatches_generatedtasks_attemptedavg_time_seconds
cexp_5e332e	Emergency compress order flip — CT collapse first Flipping emergency compress order (CT collapse first, then tool result compression) preserves recent verbatim tool results longer, improving model decisions	inconclusive	0.14	1/5	swebench_resolve_ratetime_to_solve_secondspatch_chars
cexp_8262c3	Effective context 10k with TurboQuant KV cache v2 (higher quality CUDA) Higher quality CUDA KV cache quantization reduces context rot, improving convergence at 10k effective context	inconclusive	0.14	1/5	swebench_resolve_ratetime_to_solve_secondspatch_charsrounds
cexp_87f3b0	Baseline — pre-scaffold, no context compression Establish pre-optimization baseline resolve rate across diverse SWE-bench tasks with no context compression active	baseline	0.14	1/5	swebench_resolve_ratepatches_generatedtasks_attemptedavg_time_seconds
cexp_a40f00	Initial CR/CT round compression CR/CT round compression (reasoning summary + tool breadcrumb per round) reduces context pressure and improves solve time	inconclusive	0.14	1/5	swebench_resolve_ratetime_to_solve_secondspatch_chars
cexp_d58d3d	Effective context 10k with TurboQuant CUDA v1 Increasing effective context to 10k with TurboQuant KV cache quantization allows more working memory without quality loss	failure	0.14	1/5	swebench_resolve_ratetime_to_solve_secondspatch_charsrounds
cexp_d737bf	django-15814 progression (loop debug series) Tracking iteration improvements on a single control task shows scaffold optimization impact	success	0.14	1/5	v3_cr_ct_timev4_emergency_flip_timev5_tool_aware_timev6_tiered_cr_timev8_turboquant_v1_timev9_turboquant_v2_timetotal_improvement_pct

Proposed Experiments

Effective context tokens sweep high

The optimal EFFECTIVE_CONTEXT_TOKENS for Qwen3.5-27B lies between 8000-10000. Below 6000 causes too much compression, above 12000 causes quality degradation from context rot. Multi-task comparison needed at each level.

EXP-0006, EXP-0008, EXP-0009

effective_context_tokens: [6000 tasks: five_task_subset