Skip to content

Robustness

This page summarizes the robustness checks supporting each of the four empirical claims of the paper.


1. Discrimination Robustness (Claim 1: AUC under temporal holdout)

Leakage audit (table moved to main text)

The decomposition of the in-sample item-level AUC of 0.995 into structural and pure-leakage components is reported in the main text, not the appendix. This is the primary anti-tautology check.

Audit AUC Drop from in-sample Interpretation
In-sample (full pool) 0.995 Reference
5-fold CV at cobidder-firm level 0.891 \(-0.104\) Removes cobidder-included tautology
Temporal holdout (train 2009–2016) 0.864 \(-0.131\) Removes both tautology and same-period over-fitting
Same score, direct-CADE label (scope) 0.506 (random) Confirms loser-side, not winner-side

The 0.864 figure is the AUC the paper reports as headline.

Year-by-year temporal holdout (Online Appendix B)

Rolling-origin temporal holdout: training window 2009 through (test year \(-\)1), test year provides out-of-sample cobidder ground truth. Year-by-year AUCs reported in Online Appendix B; ROC curves separate cleanly from the diagonal across all six test years.

Adversarial robustness (Online Appendix F)

Monte Carlo simulation with adversarial adaptation (cover bidders strategically increase or decrease participation to evade the screen) confirms the screen retains discrimination AUC > 0.75 under realistic adaptation. The construct degrades predictably, not catastrophically.


2. Architecture Robustness (Claim 2: 83% footprint reduction)

Sequential gatekeeper under temporal holdout

The architecture's data-envelope reduction is computed both in-sample and under temporal holdout. The 83% headline reduction is in-sample; under temporal holdout (smaller post-2016 cobidder pool), the analogous footprint reduction is recomputed and reported in Online Appendix B (Table tab_architecture_gatekeeper_th):

Quantity In-sample Temporal holdout
Pool size before triage 11,676 8,257
Cobidders to recover 193 142
Pool after triage (top-\(k\) rule) 1,985 similar relative reduction
Sequential rule recall 68% retained

The architecture's qualitative claim — substantial reduction in forensic pool with high recall preservation — survives the holdout.

Complementarity robustness

Complementarity with Imhof–Wallimann is verified on the same-sample audit (where all detectors evaluate the identical subset of firms for which Imhof features can be computed because bid microdata are available). The +0.035 AUC contribution is reported on this same-sample basis with DeLong \(p = 0.014\).


3. Frequent-Loser Construct Robustness (Claim 1 supporting)

IQR threshold sensitivity

The continuous primitive \(\log(1+\text{tenders\_count})\) is the score; the binary FL flag is its information-coarsening. The headline AUC is monotone in threshold across multiple cutoffs:

Multiplier FL Firms Discrimination preserved
1.0× 3,442 Yes
1.5× (baseline; med + 1.5 × IQR) 2,735 Yes
2.0× 2,093 Yes
3.0× 1,456 Yes (smaller pool)

Under continuous specification, the headline AUC is preserved across thresholds because the ranking is by the continuous score; the binary cutoff is an operational coarsening, not an identification step.

Horse-race against continuous score

A horse race between the binary FL flag and the continuous \(\log(1+\text{tenders\_count})\) shows the continuous score dominates discrimination. Under DeLong test, continuous-only AUC of 0.939 (in-sample) is statistically larger than binary-only AUC of 0.911 (\(p = 1.7 \times 10^{-5}\)). The framework treats the continuous primitive as the identification object; the binary rule is the deployable coarsening.


4. Pricing Imprint Robustness (Claim 4: descriptive corroboration)

The paper does not rest on the pricing imprint. These checks are reported for transparency.

Specification stability (Tier-3 corroboration)

Check Coefficient \(N\) Note
OLS (item + year + PBU FE) 0.064 1,654,401 Baseline
CEM matching 0.077 969,751 Coarsened exact matching
IPW matching 0.055 830,194 Inverse probability weighting
Cross-fit 0.036 1,654,401 FL on odd years, regression on even (and v.v.)
Item × year FE 0.074 1,654,401 Tighter controls
Two-way clustering (item + PBU) 0.064 (SE = 0.024) 1,654,401 Significant under all clustering

Sensitivity to unobservables

Metric Value Interpretation
Cinelli–Hazlett \(RV_{q=1}\) 17.5% Confounder needs to explain ≥17.5% of residual variation in both FL and prices to nullify
Oster \(\hat{\delta}\) degenerate (261.6) PBU FE barely move \(R^2\) — design strength

Sign-reversal decomposition (Online Appendix B)

Tables tab_item_level_scope_match and tab_sign_reversal_decomp in Online Appendix B carry the overlap-restricted ATT estimates and the cell-dropping decomposition. Key facts:

  • Only ~1% of treated items lack a within-cell counterfactual and are strictly dropped under overlap restriction.
  • The remaining 99% participate in both broad and overlap estimates.
  • What changes is which untreated counterfactuals each estimator up-weights — a reweighting result, not a dropped-counterfactual result.
  • Within-quintile decomposition: Q4 positive (\(+0.046\) broad, \(+0.041\) ATT), Q1–Q3 negative.

5. Heterogeneity Robustness (Claim 4 corroboration)

Buyer-size gradient stability

The 12.5× extreme-quartile gradient (Q1 vs Q4) is preserved across alternative buyer-size measures:

  • Annual contract volume
  • Cumulative item count
  • Headcount of distinct purchasers

The intermediate quartiles (Q2, Q3) are imprecisely estimated and not statistically distinguishable from each other or from Q4. The pattern is direction-preserving across measures.

The 0.952 / 0.816 pregão / convite AUC contrast is corroborated by:

  • Bootstrap difference: \(-0.136\) (\(p \approx 0\))
  • Sample-size caveat: convite-modal cobidder pool is small (6 firms) — directional indicator only
  • Within-modal price coefficient: \(+0.089\) pregão vs \(+0.037\) convite

We treat the modal contrast as scope information for the screening object (the construct discriminates better in pregão environments) — not as a positive test of any institutional channel.


6. Identification Audits (Online Appendix B)

Permutation null

The participation-stratified permutation null behind the conservative pre-2020 benchmark places the empirical excess ratio of 3.2× in the upper tail of the null distribution at \(p < 0.001\). Random reassignment of cobidder labels to firms with comparable participation intensity does not produce comparable discrimination.

Leave-one-out IV placebo

A leave-one-out instrumental-variable placebo confirms that the construct's pricing imprint draws on the above-threshold cover-bidder population, not on generic always-loser supply. Random subsamples of below-threshold always-losers do not produce comparable discrimination.

CADE-exclusion robustness

Dropping the CADE-involved tender-items and re-estimating the within-PBU baseline yields \(\hat{\beta}\) virtually unchanged (\(N = 1,453,954\)). The screening signal does not depend on within-sample CADE adjudications for its content.

Direct-defendant within-firm exercise

Among 7 always-loser direct CADE defendants, 3 (43%) cross the frequent-loser threshold against a population baseline of 16% — these are the exception within a direct-defendant population that is otherwise frequent-winner-heavy. The asymmetry between cobidder discrimination (high) and direct-defendant discrimination (random) is the design's empirical signature of loser-side scope.


7. Operational Metrics: Why In-Sample Over-States

The operational-metrics temporal-holdout audit (main text, Table tab_operational_metrics) reports holdout precision/recall/lift as the headline, with in-sample as transparency:

Top-\(k\) Holdout Lift In-sample Lift In-sample inflation
500 6.1× 11.5× ~50% over-stated
1,000 5.8× 8.5× ~32% over-stated

Roughly 47% of the in-sample top-500 ranking comes from 2017–2019 participation, after CADE adjudications were already underway for some cartels. The screen is half prospective, half retrospective in-sample. Holdout column is the operational reference.


8. Staggered DiD Reported But Not Used (Online Appendix F)

The staggered difference-in-differences specifications attempted (Callaway–Sant'Anna, stacked DiD, TWFE event study) are reported as an honest accounting of failed-but-attempted designs rather than as supporting evidence:

Specification Coefficient Status
Callaway & Sant'Anna ATT 0.014 (SE = 0.039) Insignificant
Stacked DiD \(-0.006\) (SE = 0.014) Pre-trends preclude causal reading

Minimum-detectable-effect calculations show observed-to-MDE ratios well below one — the null findings are structurally underpowered, not refutations of the underlying mechanism. The paper's contribution does not rest on staggered DiD.


Robustness Summary

Claim Robustness check Result
Discrimination (AUC 0.864) Leakage audit; year-by-year holdout; adversarial simulation Survives all
Architecture (83% footprint) Holdout gatekeeper; complementarity DeLong Survives both
Construct (FL flag) IQR threshold sweep; horse race vs continuous Survives all
Pricing imprint (descriptive) Cinelli RV; cross-fit; CEM/IPW matching; sign-reversal decomposition Reported descriptively; sign reversal disclosed
Heterogeneity (buyer-size) Alternative buyer-size measures; modal-asymmetry caveats Direction-preserving; modal asymmetry flagged as scope info

The paper's primary contributions (Claims 1–3) survive all robustness checks. Claim 4 (pricing imprint) is reported descriptively with full disclosure of identification limits.