Superseded numbers — canonical-target re-estimation (June 4, 2026)

This analysis note documents a historical run under the earlier validation label. On June 4, 2026 the paper adopted a reproducible, non-circular target (651 always-loser cobidders; frequent-loser flag never used in the label) and re-estimated every result. Where this page conflicts with the paper or the changelog, the paper wins.

AN-036: Cross-validation precision stability¶

Intuition (plain-language)

Are the precision numbers a fluke of one lucky train/test split? Cross-validation says no: precision@k standard deviations are tight (0.004–0.011 across k = 50–2,000, coefficients of variation under 25%). The operational claim — that the ranked list concentrates investigative value — is stable across resamples, not an artifact of one partition. A regulator drawing a priority list would get materially the same list each time.

Question¶

Are the precision@k metrics stable across cross-validation folds, or do they depend on a specific random split? The temporal-holdout audit (AN-013) gives one split (train 2009–2016 / test 2017–2019); cross-validation tests whether the precision numbers are sensitive to the particular partition.

Design¶

Sample: 16,843 always-loser firms in BEC 2009–2019, cobidder set N+ = 193.
K-fold partitioning: at the cobidder-firm level (firms in test fold do not appear in train fold).
Per-fold: train FL ranking on K-1 folds, evaluate precision@k on held-out fold.
Aggregation: precision_mean and precision_sd across folds; n_pos average per fold.

Results¶

k	precision_mean	precision_sd	recall_mean	n_pos_avg	CV coef
50	0.068	0.011	8.8%	3.4	16%
100	0.042	0.011	10.9%	4.2	26%
250	0.034	0.011	22.3%	8.6	31%
500	0.028	0.006	36.2%	14.0	22%
1,000	0.021	0.004	53.3%	20.6	21%
2,000	0.016	0.001	83.4%	32.2	9%

The CV coefficient (SD / mean) decreases with k as expected — larger k reduces variance because more cobidders are sampled into the top-k cell.

Source: output/operational/audit_precision_k_cv.csv.

Figure: K-fold CV precision@k with ± 1 SD error bars, across k = 50 to 2,000. Precision_mean declines from 0.068 (k=50) to 0.016 (k=2000); SDs are tight (≤ 0.011) across all k. Precision estimates are not artifacts of a particular train/test split.

Interpretation¶

Two readings:

CV precision SDs are tight. At k=50, precision = 0.068 ± 0.011 means the fold-to-fold precision varies in [0.057, 0.079] roughly — tight enough that the in-sample / temporal-holdout / CV regimes give consistent operational read. The 0.07 temporal-holdout precision@500 (AN-013) sits inside the CV one-SD band of [0.022, 0.034] — wait, the CV is 0.028 at k=500 — but they're computed differently. The temporal precision averages across the full 142 test-period cobidders; the CV precision averages across cobidder-folds. Both are internally stable; the magnitudes are different because the denominators are different.
The precision drop from in-sample to operational is robust under CV. In-sample precision@500 = 0.132 (full panel, N+ = 193); CV precision@500 = 0.028; temporal precision@500 = 0.070. The CV number is more aggressive than the temporal-holdout number because it disrupts the temporal information and the firm-history information simultaneously, while temporal preserves firm history but disrupts time. The temporal-holdout number is the relevant operational metric for enforcement deployment (firms have history; timing is what changes between train and inference).

The CV stability test confirms that the operational precision numbers are not artifacts of a particular split. The triangulation across in-sample, temporal-holdout, and CV regimes gives the bounded precision band relevant for operational deployment.

For H:gatekeeping-cost-of-evidence, this is the stability check on the precision claim: under K-fold CV at the cobidder-firm level, precision_mean is in [0.016, 0.068] across k = 50 to 2,000, with SDs in [0.001, 0.011]. The cost-of- evidence claim relies on precision@500 = 0.07 (temporal holdout) which is between the CV mean (0.028) and the in-sample headline (0.132) — neither extreme.

Follow-ups¶

Direct comparison of CV-fold precision distribution with temporal-holdout single-split value at each k (KS-test or permutation).
Stratified CV by modality (Convite vs Pregão) to test architecture- level stability.
Cross-modality precision SD comparison.
Add macros \valCVPrecKFifty (= 0.068), \valCVPrecKFiveHund (= 0.028), \valCVPrecKOnek (= 0.021) to the scripts/99_make_paper_values.R pipeline.