Calibrating on wikitext.train and evaluating on wikitext.test may have residual domain leakage. Re-running calibration on C4 (general web) and the-stack-python (code) should give similar PPL; if not, the wikitext-train calibration is overfitting the eval domain
EXP-0017