Skip to content

Robustness

Intuition (plain-language)

Each check here exists to kill a boring explanation for the headline result — "it's just bidding volume," "it's overfitting," "it's a lucky threshold," "it's one weird year." The discrimination and architecture claims survive all of them, and what degrades does so predictably rather than catastrophically. The page also keeps the failures in plain sight: the price-as-damages reading does not survive overlap discipline, and the difference-in-differences designs are underpowered. A screen worth trusting is one whose limits are mapped as carefully as its strengths.

This page summarizes the robustness checks supporting each of the four empirical claims of the paper.


1. Discrimination Robustness (Claim 1: AUC under temporal holdout)

Leakage audit (table moved to main text)

The decomposition of the in-sample item-level AUC of 0.995 into structural and pure-leakage components is reported in the main text, not the appendix. This is the primary anti-tautology check.

Audit AUC Drop from in-sample Interpretation
In-sample (full pool) 0.995 Reference
5-fold CV at cobidder-firm level 0.891 \(-0.104\) Removes cobidder-included tautology
Temporal holdout (train 2009–2016) 0.864 \(-0.131\) Removes both tautology and same-period over-fitting
Same score, direct-CADE label (scope) 0.506 (random) Confirms loser-side, not winner-side

The 0.864 figure is the AUC the paper reports as headline.

Year-by-year temporal holdout (Online Appendix B)

Rolling-origin temporal holdout: training window 2009 through (test year \(-\)1), test year provides out-of-sample cobidder ground truth. Year-by-year AUCs reported in Online Appendix B; ROC curves separate cleanly from the diagonal across all six test years.

Adversarial robustness (Online Appendix F)

Monte Carlo simulation with adversarial adaptation (cover bidders strategically increase or decrease participation to evade the screen) confirms the screen retains discrimination AUC > 0.75 under realistic adaptation. The construct degrades predictably, not catastrophically.

Formal sham permutation against volume-only null

B = 2,000 sham permutations of the FL14 label preserving the marginal tenders_count distribution. Sham AUC distribution: mean 0.500, SD 0.013, q99 0.531, max 0.547. Observed FL14 AUC of 0.924 is ~32 σ above the sham mean (permutation p < 1/2,000; rejects volume-only null at 99%). The price coefficient under the same sham draws does not reject (observed +0.064 vs sham mean +0.144, p = 0.989) — the second piece of the scope-not-damages reading.

Universe-anchored scope matrix

8-row matrix varies the universe (always-losers vs full BEC firm registry) and the positive class (cobidders vs direct CADE defendants) systematically. The matrix confirms the score targets exactly what the loser-side framing predicts (AUC 0.924/0.939 on cobidders inside always-losers) and actively repels what it should not target (participation count on full BEC vs direct CADE = 0.383, below random). Rules out a "generic detector" reading.

Cross-validation precision stability

K-fold CV of precision@k at cobidder-firm level. Precision_mean across k = 50 to 2,000: [0.016, 0.068]; SDs [0.001, 0.011]; CV coefficient < 31% across all k. The precision estimates are not artifacts of a particular train/test split.


2. Architecture Robustness (Claim 2: 83% footprint reduction)

Sequential gatekeeper under temporal holdout

The architecture's data-envelope reduction is computed both in-sample and under temporal holdout. The 83% headline reduction is in-sample; under temporal holdout (smaller post-2016 cobidder pool), the analogous footprint reduction is recomputed and reported in Online Appendix B (Table tab_architecture_gatekeeper_th):

Quantity In-sample Temporal holdout
Pool size before triage 11,676 8,257
Cobidders to recover 193 142
Pool after triage (top-\(k\) rule) 1,985 similar relative reduction
Sequential rule recall 68% retained

The architecture's qualitative claim — substantial reduction in forensic pool with high recall preservation — survives the holdout.

Complementarity robustness

Complementarity with Imhof–Wallimann is verified on the same-sample audit (where all detectors evaluate the identical subset of firms for which Imhof features can be computed because bid microdata are available). The +0.035 AUC contribution is reported on this same-sample basis with DeLong \(p = 0.014\).


3. Frequent-Loser Construct Robustness (Claim 1 supporting)

IQR threshold sensitivity

The continuous primitive \(\log(1+\text{tenders\_count})\) is the score; the binary FL flag is its information-coarsening. The headline AUC is monotone in threshold across multiple cutoffs:

Multiplier FL Firms Discrimination preserved
1.0× 3,442 Yes
1.5× (baseline; med + 1.5 × IQR) 2,735 Yes
2.0× 2,093 Yes
3.0× 1,456 Yes (smaller pool)

Under continuous specification, the headline AUC is preserved across thresholds because the ranking is by the continuous score; the binary cutoff is an operational coarsening, not an identification step.

Horse-race against continuous score

A horse race between the binary FL flag and the continuous \(\log(1+\text{tenders\_count})\) shows the continuous score dominates discrimination. Under DeLong test, continuous-only AUC of 0.939 (in-sample) dominates the binary FL14 flag (0.924); the gap is 0.015 under the corrected FL14 (\(\geq 14\)) definition (DeLong \(Z = -4.38\), \(p = 1.2 \times 10^{-5}\); the earlier \(p = 1.7 \times 10^{-5}\) was computed under the superseded \(>14\) / FL15 cut). The framework treats the continuous primitive as the identification object; the binary rule is the deployable coarsening.

Cutoff-sweep robustness (full cutoff sweep)

Sweeping the FL cutoff (tenders_count \(> k\)) from k = 2 through k = 100 produces a clean inverted-U in cobidder AUC. Because the sweep indexes cutoffs as \(> k\), the paper's FL14 (\(\geq 14\)) is the \(> 13\) row — the peak of the sweep at AUC 0.924 (2,735 firms, 193 cobidders); the \(> 14\) (FL15) cutoff sits one bucket tighter at 0.911. The paper's IQR-defined choice (median + 1.5 × IQR statistic = 13.5) sits at the top of a broad high plateau, not on a fragile peak.

Subsample robustness (4 subsamples × 4 scores)

AUC across full / data-rich / low-bid-count / high-bid-count subsamples crossed with FL14 / continuous / Imhof CV / Imhof log_sd. FL14 AUC stays in the 0.89–0.91 band across all four subsamples; continuous AUC dominates in every cell. The signal does not track bid-microdata richness — important cross-cut for the cost-of-evidence and complementarity claims.

Matched-heterogeneity audit (honest negative)

Within-FL14 quadrant heterogeneity (HHI × pairs) under PS matching on tenders_count largely does not survive: Low-HHI × Low-pairs cell drops from unmatched +0.090 (p = 0.048) to matched +0.066 (p = 0.227). The first-time-FL channel similarly attenuates from +0.10 unconditional to +0.062 (p = 0.31) under PS matching. Reported transparently as the boundary of what the within-FL distinctness can claim — most of it is a volume effect; what survives matching is the bid-level conduct signature (median gap-to-winner d = −0.28).


4. Pricing Imprint Robustness (Claim 4: descriptive corroboration)

The paper does not rest on the pricing imprint. These checks are reported for transparency.

Specification stability (Tier-3 corroboration)

Check Coefficient \(N\) Note
OLS (item + year + PBU FE) 0.064 1,654,401 Baseline
CEM matching 0.077 969,751 Coarsened exact matching
IPW matching 0.055 830,194 Inverse probability weighting
Cross-fit 0.036 1,654,401 FL on odd years, regression on even (and v.v.)
Item × year FE 0.074 1,654,401 Tighter controls
Two-way clustering (item + PBU) 0.064 (SE = 0.024) 1,654,401 Significant under all clustering

Sensitivity to unobservables

Metric Value Interpretation
Cinelli–Hazlett \(RV_{q=1}\) 17.5% Confounder needs to explain ≥17.5% of residual variation in both FL and prices to nullify
Oster \(\hat{\delta}\) degenerate (261.6) PBU FE barely move \(R^2\) — design strength

Sign-reversal decomposition (Online Appendix B)

Tables tab_item_level_scope_match and tab_sign_reversal_decomp in Online Appendix B carry the overlap-restricted ATT estimates and the cell-dropping decomposition. Key facts:

  • Only ~1% of treated items lack a within-cell counterfactual and are strictly dropped under overlap restriction.
  • The remaining 99% participate in both broad and overlap estimates.
  • What changes is which untreated counterfactuals each estimator up-weights — a reweighting result, not a dropped-counterfactual result.
  • Within-quintile decomposition: Q4 positive (\(+0.046\) broad, \(+0.041\) ATT), Q1–Q3 negative.

5. Heterogeneity Robustness (Claim 4 corroboration)

Buyer-size gradient stability

The 12.5× extreme-quartile gradient (Q1 vs Q4) is preserved across alternative buyer-size measures:

  • Annual contract volume
  • Cumulative item count
  • Headcount of distinct purchasers

The intermediate quartiles (Q2, Q3) are imprecisely estimated and not statistically distinguishable from each other or from Q4. The pattern is direction-preserving across measures.

The 0.952 / 0.816 pregão / convite AUC contrast is corroborated by:

  • Bootstrap difference: \(-0.136\) (\(p \approx 0\))
  • Sample-size caveat: convite-modal cobidder pool is small (6 firms) — directional indicator only
  • Within-modal price coefficient: \(+0.089\) pregão vs \(+0.037\) convite

We treat the modal contrast as scope information for the screening object (the construct discriminates better in pregão environments) — not as a positive test of any institutional channel.


6. Identification Audits (Online Appendix B)

Permutation null

The participation-stratified permutation null behind the conservative pre-2020 benchmark places the empirical excess ratio of 3.2× in the upper tail of the null distribution at \(p < 0.001\). Random reassignment of cobidder labels to firms with comparable participation intensity does not produce comparable discrimination.

Leave-one-out IV placebo

A leave-one-out instrumental-variable placebo confirms that the construct's pricing imprint draws on the above-threshold cover-bidder population, not on generic always-loser supply. Random subsamples of below-threshold always-losers do not produce comparable discrimination.

CADE-exclusion robustness

Dropping the CADE-involved tender-items and re-estimating the within-PBU baseline yields \(\hat{\beta}\) virtually unchanged (\(N = 1,453,954\)). The screening signal does not depend on within-sample CADE adjudications for its content.

Direct-defendant within-firm exercise

Among 7 always-loser direct CADE defendants, 3 (43%) cross the frequent-loser threshold against a population baseline of 16% — these are the exception within a direct-defendant population that is otherwise frequent-winner-heavy. The asymmetry between cobidder discrimination (high) and direct-defendant discrimination (random) is the design's empirical signature of loser-side scope.


7. Operational Metrics: Why In-Sample Over-States

The operational-metrics temporal-holdout audit (main text, Table tab_operational_metrics) reports holdout precision/recall/lift as the headline, with in-sample as transparency:

Top-\(k\) Holdout Lift In-sample Lift In-sample inflation
500 6.1× 11.5× ~50% over-stated
1,000 5.8× 8.5× ~32% over-stated

Roughly 47% of the in-sample top-500 ranking comes from 2017–2019 participation, after CADE adjudications were already underway for some cartels. The screen is half prospective, half retrospective in-sample. Holdout column is the operational reference.


8. Staggered DiD Reported But Not Used (Online Appendix F)

The staggered difference-in-differences specifications attempted (Callaway–Sant'Anna, stacked DiD, TWFE event study) are reported as an honest accounting of failed-but-attempted designs rather than as supporting evidence:

Specification Coefficient Status
Callaway & Sant'Anna ATT 0.014 (SE = 0.039) Insignificant
Stacked DiD \(-0.006\) (SE = 0.014) Pre-trends preclude causal reading

Minimum-detectable-effect calculations show observed-to-MDE ratios well below one — the null findings are structurally underpowered, not refutations of the underlying mechanism. The paper's contribution does not rest on staggered DiD.


Robustness Summary

Claim Robustness check Result
Discrimination (AUC 0.864) Leakage audit; year-by-year holdout; adversarial simulation Survives all
Architecture (83% footprint) Holdout gatekeeper; complementarity DeLong Survives both
Construct (FL flag) IQR threshold sweep; horse race vs continuous Survives all
Pricing imprint (descriptive) Cinelli RV; cross-fit; CEM/IPW matching; sign-reversal decomposition Reported descriptively; sign reversal disclosed
Heterogeneity (buyer-size) Alternative buyer-size measures; modal-asymmetry caveats Direction-preserving; modal asymmetry flagged as scope info

The paper's primary contributions (Claims 1–3) survive all robustness checks. Claim 4 (pricing imprint) is reported descriptively with full disclosure of identification limits.