智能AI morning

一根标尺:使用无参数压缩基线对 Tuebingen 上的双变量因果方向进行同一手重新评估

2026-06-24 1 阅读 Wietse Stienstra
arXiv:2606.23767v1 Announce Type: new Abstract: Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors' own protocol -- different pair subsets, weightings, model-selection, and decision rates. We argue this is the wrong comparison and run the right one: a same-hands re-evaluation in which every method is run by us on the identical 102 pairs, with one strict rule -- no tuning and a decision forced on every pair. As a clean reference point we introduce a deliberately minimal baseline: sorted-conditional compression, which feeds quantized, sorted, first-differenced data to an off-the-shelf compressor (bz2) and has zero fitted parameters.在共同的标尺下,排名与文献中的排名有很大不同。我们的基线加权准确率达到 74.7% (p = 3.7e-7);在对 SLOPE 进行评估的同一 100 对中,它的得分为 76.0%,比作者自己的强制决策 SLOPE (77.2%) 低 1.2 个百分点,该结果完全处于噪声范围内 (McNemar p = 0.39)。 A faithful re-run of RECI lands at 70.7% -- inside the original authors' reported error bar, not the 77.5% often quoted (which we trace to a mis-copied cell). SLOPE's published 82.4% is a decided-subset figure: scoring the authors' own stored output only on the pairs its significance test chose to answer reproduces 81.7%.在共同的标尺下,这些方法集中在 70 年代中后期,而零参数压缩器将其中最强的方法联系在一起。 We document the mechanisms that inflate published figures (test-set model selection, significance-gated abstention) and contribute two further results: compression score magnitude is a model-free confounding flag (p = 2.8e-68), and a pre-registered falsification test fails in an instructive way that bounds the method's theoretical interpretation.代码、预注册和每对输出均已发布。