Skip to content

AN-026: Subsample robustness (data-rich vs low/high bid counts)

Intuition (plain-language)

Does the result only work for firms with rich bid histories? If so it would inherit the data dependence of the expensive screens it claims to replace. It doesn't: FL14 AUC stays in a tight band (0.887–0.912) across data-rich, low-bid, and high-bid subsamples. That data-independence is the whole operational selling point — the award-layer screen runs on cheap administrative records whether or not bid microdata exists.

Question

Does the cobidder concentration result survive across always-loser sub-populations defined by bid-microdata availability? This is the right cross-cut for the cost-of-evidence argument (H:gatekeeping-cost-of-evidence) and for the complementarity claim (H:award-bid-complementarity): if the FL signal were merely picking up bid-data availability, AUC would track availability across the subsamples. It does not.

Design

  • Full sample: 16,843 always-losers; 193 cobidders.
  • Subsamples:
  • full: all 16,843 firms.
  • data_rich: firms with bid microdata available (N = 12,575; 193 cobidders).
  • low_n_bids: firms with low bid count (N = 11,134; 132 cobidders).
  • high_n_bids: firms with high bid count (N = 5,709; 61 cobidders).
  • Scores evaluated per subsample:
  • is_fl: FL14 binary indicator;
  • tenders_count: continuous log_tc;
  • imhof_cv: bid coefficient of variation;
  • imhof_log_sd: log SD of bids.

Results

AUC × subsample × score (selected; full table in source CSV):

Subsample FL14 log_tc Imhof CV Imhof log_sd N N+
full 0.924 0.939 0.854 0.790 16,843 193
data_rich 0.887 0.918 0.840 0.781 12,575 193
low_n_bids 0.909 0.933 0.847 0.765 11,134 132
high_n_bids 0.912 0.962 0.865 0.824 5,709 61

All AUCs come with 95% bootstrap CIs in the source CSV. Selected CIs:

  • FL14 full: [0.898, 0.925]
  • log_tc full: [0.932, 0.946]
  • log_tc high_n_bids: [0.954, 0.970] — tightest in the high-bid-count cell where the score discriminates most.
  • FL14 high_n_bids: [0.875, 0.950] — widest in the smallest subsample.

Source: output/auc_by_subsample/auc_subsample.csv.

AUC across four subsamples × four scores

Figure: AUC point estimates with 95% CIs across the four subsamples (full / data_rich / low_n_bids / high_n_bids) crossed with the four scores (FL14, log_tc, Imhof CV, Imhof log_sd). FL14 AUC stable in the 0.89–0.91 band across subsamples; log_tc dominates in every cell; Imhof scores below the award-layer scores throughout.

Interpretation

Four readings, all supporting H1:

  1. FL signal does not depend on bid-microdata richness. FL14 AUC is 0.924 in the full sample, 0.887 in data-rich, 0.909 in low-bid, 0.912 in high-bid. Range = 0.025. The signal carries across the data richness margin — the screen is not a proxy for "we have lots of bid data on this firm".

  2. Continuous score dominates in every subsample. log_tc AUC exceeds FL14 in all four cells (0.939 > 0.924 full; 0.918 > 0.887 data-rich; 0.933 > 0.909 low-bid; 0.962 > 0.912 high-bid). The horse-race result of AN-011 is not driven by any one subsample.

  3. Imhof CV is informative but lower across the board. Imhof CV AUC 0.84–0.87 across subsamples — well above chance, but consistently below the award-layer scores. This is the same complementarity pattern as AN-010 and supports the "two layers operate at different evidentiary stages" framing.

  4. High-bid-count cell is the tightest discriminator. log_tc AUC reaches 0.962 in the high-bid-count subsample. This is where the joint scoring of AN-010 approaches its full-observability ceiling. The cell is small (N = 5,709) but gives the cleanest read.

The subsample sweep is the cross-cut that the JLEO reviewer will ask for: "does your result come from the same firms that have rich bid data?" The answer is no — the screen carries across the data-availability margin.

Follow-ups

  • Decomposition by procurement modality crossed with bid-data richness.
  • Sub-period × subsample crossover.
  • Add macros \valAUCSubsampleFull, \valAUCSubsampleDataRich, \valAUCSubsampleLowBids, \valAUCSubsampleHighBids (per score) to scripts/99_make_paper_values.R.