Attribution Method Comparison: AirRep vs STRIDE vs LoGRA
Task: detect benchmark contamination in Qwen2.5-0.5B fine-tuned on MATH. 5 models evaluated (0.5%×2 seeds, 1%×3 seeds). Spearman ρ is against the binary leaked/not-leaked label. See the Community tab to discuss.
Mean metrics over 5 models
Summary
AirRep ≈ STRIDE on Spearman ρ (0.117 vs 0.115). LoGRA ≈ 0. STRIDE edges AirRep on ROC-AUC; AirRep leads on AUPRC and MRR.