Skip to content

AN-013: Precision@k audit — temporal holdout vs in-sample

Intuition (plain-language)

In-sample precision is inflated because the screen has already seen the data it is graded on. Under an honest train-on-past, test-on-future split, precision@500 falls from 0.132 to 0.070 (53% retained) and lift from 11.5× to 6.1×. The honest operational number is about half the headline — and the paper reports both columns side by side, resting its claims on the lower one. Roughly half the in-sample ranking power came from cases already under investigation when the data were generated.

Question

What are the temporal-holdout precision@k and lift metrics, and how much does the in-sample evaluation inflate operational numbers? The audit documents the gap honestly.

Design

  • Sample: 16,843 always-losers in BEC 2009–2019.
  • Split: train 2009–2016, test 2017–2019.
  • Metrics: precision@k, recall@k, lift in both columns (in-sample, temporal holdout); retention = (temporal / in-sample).

Results

k precision@k (in-sample) precision@k (temporal holdout) retention lift (in-sample) lift (TH)
50 0.300 0.020 7% 26.2× 1.7×
100 0.170 0.070 41% 14.8× 6.1×
200 0.160 0.076 48% 14.0× 6.6×
500 0.132 0.070 53% 11.5× 6.1×
1000 0.097 0.066 68% 8.5× 5.8×

Recall@k (temporal holdout): @500 = 18.1%; @1000 = 34.2%.

Macros: \valPrecInSFivehu (0.132), \valPrecTHFivehu (0.070), \valLiftInSFivehu (11.5×), \valLiftTHFivehu (6.1×), \valOpRetentionFiveHund (53%), \valOpInflationShare (47%), \valOpRecallFiveHund (18%), \valOpRecallThou (34%).

AN-013 precision@k — in-sample vs temporal holdout

Figure: precision@k under in-sample evaluation (navy) vs temporal- holdout evaluation (red), k = 50 to 1,000. The gap is largest at k=50 (0.300 vs 0.020) and narrows at high k. At k=500, in-sample is 0.132 and temporal-holdout is 0.070 — retention 53%, the operational calibration number.

Interpretation

Verdict: INFLATED in-sample. Operational deployment metrics are roughly half the in-sample upper bounds:

  • precision@500 retains 53% under temporal holdout (0.070 vs 0.132);
  • lift retains 53% (6.1× vs 11.5×);
  • recall@500 is 18% (vs ~34% in-sample at the same k).

Source of inflation: ~47% of the top-500 ranking comes from 2017–2019 participation, after CADE investigation was already underway for some of the cartels. The screen is therefore half prospective, half retrospective in the in-sample regime.

The paper reports both columns in the operational metrics table; the text relies on the temporal-holdout column for the operational claim. AUC firm-level under temporal holdout (0.864, see AN-014) is the honest discriminating performance number.

Follow-ups

  • Compare with strict prospective-only deployment under alternative train windows (AN-006).
  • Headcount analysis at k = 500 cutoff (35 cobidders flagged operationally).
  • Sensitivity to k under each evaluation regime.