Replication¶
Everything is reproducible from raw data to the final PDF. The construct is built entirely from contract-award records, and the analytical pipeline runs on a single workstation in roughly ten minutes.
Replication Package¶
The replication archive contains the analysis pipeline (R + Python), the construction scripts for the derived datasets, and the LaTeX sources for the manuscript. It will be released through the journal's data-availability portal at publication.
The repository is organized under paper3-frequent-losers/ in the source monorepo:
| Directory | Contents |
|---|---|
scripts/ |
Python data-build scripts (numbered 00_*.py) and R analysis scripts (numbered 01_*.R through 64_*.R) |
data/processed/ |
Parquet files produced by the data build (git-ignored; built from raw inputs) |
output/tables/ |
All regression tables in LaTeX |
output/figures/ |
All figures in PDF |
work/v17-editor/ |
LaTeX manuscript sources for the current version (v17) |
work/v16-editor/ |
v16 byte-frozen as recoverable hedge |
Software Requirements¶
Python (data build)¶
| Software | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Data preprocessing |
pandas |
latest | Data manipulation |
pyarrow |
latest | Parquet I/O |
R (analysis pipeline)¶
| Software | Version | Purpose |
|---|---|---|
| R | 4.5+ | Statistical computing |
fixest |
latest | High-dimensional fixed effects, IV, two-way clustering |
data.table |
latest | Fast data manipulation |
arrow |
latest | Parquet readers |
modelsummary |
latest | Regression tables |
kableExtra |
latest | LaTeX table formatting |
ggplot2 |
latest | Figures |
sensemakr |
latest | Cinelli–Hazlett robustness value |
MatchIt |
latest | CEM matching, IPW, common-support trimming |
pROC |
latest | ROC curves and DeLong AUC tests |
randomForest |
latest | Imhof full-pipeline classifier |
Manuscript¶
| Software | Version | Purpose |
|---|---|---|
| LaTeX | TeX Live 2024+ | Document typesetting |
elsarticle |
latest | Document class |
chicago |
latest | Bibliography style (natbib + bibtex) |
Data Sources¶
Primary datasets¶
The pipeline reads from data/processed/:
| File | Rows | Description |
|---|---|---|
firm_loss_stats.parquet |
41,444 | Per-firm aggregated stats (win rate, tenders count, always-loser flag) |
firm_tender_map.parquet |
16.8M | Firm × tender-item participation with win flag |
LOSERS_rebuilt.parquet |
85K | Frequent-loser counts per (OC, item) |
FREQ_PARTICIP_rebuilt.parquet |
16,843 | Always-losers with participation counts |
BEC_collapse_final.parquet |
4.5M | Collapsed tender-item dataset |
Firms_final.parquet |
39.6K | Firm registry (CNPJ, CNAE, size, location) |
bid_level_full.parquet |
40M | Raw bid-level data (used by some robustness checks) |
CADE validation data¶
| File | Rows | Description |
|---|---|---|
cade_carteis_licitacoes_2009_2019.csv |
65 | CADE cartel-defendant firm records relevant to the BEC sample |
cade_bec_crossmatch.csv |
47 | CADE-defendant firms active in BEC during the sample window |
cade_fl_cobidders.csv |
193 | Always-loser firms that co-participate with CADE defendants in the same tender-items |
Data access
The BEC procurement records are publicly available through the São Paulo state transparency portal. CADE adjudications are public records hosted at gov.br/cade. The processed Parquet files are built from raw .dta and .csv exports via scripts/00_build_bidlevel.py.
Key variables¶
| Variable | Description |
|---|---|
losers |
Frequent-loser presence indicator (1 if at least one FL participant in the tender-item) |
lneg_price |
Log negotiated unit price (headline DV) |
ln_firms, ln_bids, ln_firms_excl |
Auxiliary outcomes (number of firms, number of bids, number of non-FL firms) |
tenders_count |
Per-firm participation count over the sample window |
win_rate |
Per-firm wins divided by participations |
item_f, year_f, pbu_f |
Item, year, and procuring-unit fixed effects |
convite |
Procurement modality indicator (1 = convite, 0 = pregão) |
Running the Analysis¶
Step 1 — data build (Python, ~6 min)¶
cd paper3-frequent-losers
python3 scripts/00_build_bidlevel.py
Only needed when the Parquet files in data/processed/ are missing. The script reads the raw exports from BEC and writes the four primary parquets plus the bid-level archive.
Step 2 — analysis pipeline (R, ~10 min on 16 cores)¶
cd paper3-frequent-losers
Rscript scripts/00_master.R
The master script runs the analysis blocks in order. The blocks below correspond to the v17 manuscript sections:
| Block | Scripts | Manuscript section |
|---|---|---|
| Data prep + FL classification | 01_clean.R |
§4 (data) |
| Headline regressions (4 specs, 4 DVs) | 02_analysis.R, 03_tables.R |
§6.1 (broad-sample β) |
| Figures (FL distribution, IQR, oversight gradient) | 04_figures.R, 41_fix_figures.R |
§6, appendix |
| Threshold and clustering robustness | 05_robustness.R |
§9.1 |
| Modal falsification + within-PBU oversight | 07_heterogeneity.R |
§6.1, §6.3 |
| Auxiliary outcomes | 08_additional_dvs.R |
§6 (auxiliary) |
| CEM, IPW, IV diagnostics | 09_matching.R |
§9.1, §9.2 |
| FL firm characteristics | 10_fl_characteristics.R |
§4.4 |
| Item-value panel + RDD diagnostic | 12_build_item_value.R, 13_rdd_cap.R |
§5.1 |
| DiD with 2018 cap raise | 14_did_decreto_2018.R |
§5.1, appendix F |
| Imhof full-pipeline horse race | 31_imhof_full_pipeline.R, 49_imhof_incremental_value.R |
§10 (architecture) |
| Direct-CADE AUC (47 defendants) | 33_auc_direct_cade.R |
§7, §10.2 |
| Horse race binary vs continuous | 34_horse_race_fl_continuous.R, 36_gate_d1_harmonized.R |
§8.1, §9.1 |
| Modal-by-modal AUC (gate D2) | 37_gate_d2_modal_auc.R |
§8.2 |
| Continuous loser-side discrimination (gate D3) | 38_gate_d3_continuous_only.R |
§8.1 |
| CADE winner-heavy diagnostic (gate D4) | 39_gate_d4_cade_winner_heavy.R |
§7 |
| Anti-leakage audit | 40_leakage_audit_d3.R |
§9.2 |
| Operational metrics (in-sample + temporal holdout) | 42_operational_metrics.R, 43_precision_at_k_audit.R |
§9.3 |
| Strict-overlap matching | 51_item_level_scope_match.R |
§6.2, §9.2 |
| Strict-train threshold (temporal) | 53_strict_train_period_threshold.R |
§4.2 |
| Threshold table (Q3 IQR alt) | 54_threshold_table_q3iqr.R |
§9.1, appendix |
| Sign-reversal segment decomposition (Q4 finding + trim sensitivity) | 61_sign_reversal_segment_decomp.R |
§6.2 (segment-level decomposition) |
| Bid-level theory bridge (R1/R2 distinction) | 62_theory_bridge_bidlevel.R |
§7 (within-stratum bid-level bridge) |
| Architecture / sequential gatekeeper (precision/recall envelope) | 63_architecture_gatekeeper.R |
§10 (gatekeeper rule) |
| Temporal-holdout audit of the gatekeeper | 64_gatekeeper_temporal_holdout.R |
§10 (out-of-time robustness) |
Step 3 — manuscript compilation¶
cd paper3-frequent-losers/work/v17-editor
pdflatex -interaction=nonstopmode paper_v17editor_online_appendix.tex
bibtex paper_v17editor_online_appendix
pdflatex paper_v17editor_online_appendix.tex
pdflatex paper_v17editor.tex
bibtex paper_v17editor
pdflatex paper_v17editor.tex
pdflatex paper_v17editor.tex
pdflatex paper_v17editor_online_appendix.tex
Bibliography and cross-references
The manuscript uses natbib + bibtex (not biblatex/biber). Run bibtex for the bibliography pass. The xr package cross-references between main paper and online appendix require the OA .aux to exist before the main paper compiles for the first time, hence the OA-then-main sequence above.
The outputs are paper_v17editor.pdf (47 pages, body) and paper_v17editor_online_appendix.pdf (20 pages, online appendix containing the four formal results, supporting tables, identification audits, threshold robustness, generalisation audits, adversarial-adaptation simulation, and the staggered-design failures).
Output Files¶
Tables¶
output/tables/ contains 30+ LaTeX table fragments (booktabs + threeparttable). The tables actually input by the v15 manuscript and appendix are listed in the manuscript source under \input{...} directives.
Figures¶
output/figures/ contains the PDF figures. The two figures included in the published manuscript are:
| Figure | File | Manuscript section |
|---|---|---|
| FL participation distribution | fig_01_losses_distribution.pdf |
Appendix B |
| IQR threshold identification | fig_02_iqr_identification.pdf |
Appendix B |
Manuscript¶
| File | Pages | Description |
|---|---|---|
work/v17-editor/paper_v17editor.pdf |
47 | Main manuscript (current version) |
work/v17-editor/paper_v17editor_online_appendix.pdf |
20 | Online appendix |
work/v16-editor/paper_v16editor.pdf |
47 | v16 byte-frozen as recoverable hedge |
Computational Environment¶
| Component | Specification |
|---|---|
| OS | Ubuntu 24.04 (WSL2 on Windows) |
| CPU | 14 threads (Intel i7-1260P, 4 P-cores + 8 E-cores) |
| RAM | 21 GB |
| R | 4.5+ |
fixest |
OpenMP, 12 threads (saturating CPU within the workstation budget) |
data.table |
Multi-threaded via setDTthreads(12) |
| Parquet | DuckDB engine (PRAGMA threads=12, memory_limit='14GB') for any non-trivial read/filter/join |
Runtime
The full pipeline takes approximately 10 minutes on the reference workstation. The most time-intensive steps are the bid-level horse race (31_imhof_full_pipeline.R) and the strict-overlap matching (51_item_level_scope_match.R), each of which processes a few million rows under random forests or propensity-score trimming.
Caching
Intermediate data is cached at /tmp/ for fast reload. The prepared dataset (/tmp/p3_prepared.rds) is created by 01_clean.R and reused by all subsequent scripts. Cache invalidation is manual: delete the rds files when the underlying parquets change.
Reproducibility Note¶
The numerical claims in the paper come from a single pipeline run on the canonical 2009–2019 BEC parquet files plus the CADE crossmatch CSV. The pipeline applies a strict 14-character CNPJ zero-padding convention before every merge: this padding is required to recover the full 193 FL–CADE co-bidder set, and alternative conventions yield a smaller subset that appears in earlier draft tables. The convention is flagged at the top of scripts/00_build_bidlevel.py so future replications reproduce either count deterministically.
The headline AUC of 0.924 against 193 cobidders corresponds to point estimate 0.9389 on the always-loser pool of 16,843 firms; the 95% confidence interval [0.921, 0.926] uses DeLong's method, with stratified percentile and BCa bootstrap intervals (B = 5,000 and 2,000 respectively) agreeing to three decimal places. The temporal-holdout AUC of 0.864 (train 2009–2016, test 2017–2019) is computed in 40_leakage_audit_d3.R. The full sequence of commits that produced the final tables and figures is preserved in the replication archive's git log.