Replication¶

This page describes how to reproduce every table and figure in the paper.

Replication Package¶

The package contains all code, processed datasets, and manuscript source files needed to reproduce the results. The pipeline runs in two stages: a Python data build that produces the analysis Parquets once, and an R analysis pipeline driven by a single master script.

Repository structure

Materials live under paper3-frequent-losers/. scripts/ contains the Python build and the numbered R pipeline; data/processed/ holds the Parquet inputs (git-ignored); output/tables/ and output/figures/ hold generated artifacts; the submission-clean manuscript source is in work/v20-editor/submission_clean/.

Software Requirements¶

Python (data build)¶

Software	Version	Purpose
Python	3.12	Data preprocessing
`pandas` / `pyarrow`	latest	Parquet I/O
`duckdb`	latest	Out-of-core joins and aggregation

R (analysis pipeline)¶

Software	Version	Purpose
R	4.5+	Statistical computing
`fixest`	latest	High-dimensional fixed effects (OpenMP, 12 threads)
`data.table` + `arrow`	latest	Fast manipulation and Parquet reads
`duckdb`	latest	Joins and aggregation over Parquet
`modelsummary` + `kableExtra`	latest	LaTeX regression tables
`ggplot2`	latest	Publication-quality figures
`pROC`	latest	AUC and DeLong tests
`sensemakr`	latest	Cinelli & Hazlett (2020) sensitivity
`MatchIt`	latest	CEM / IPW matching
`did`	latest	Callaway & Sant'Anna (2021) staggered DiD

Manuscript¶

Software	Version	Purpose
LaTeX	TeX Live 2024+	Typesetting
`elsarticle`	latest	Journal document class
`chicago`	latest	Bibliography style (natbib + bibtex)

Data Sources¶

Primary datasets¶

The pipeline reads from data/processed/ (built once by the Python stage):

File	Rows	Description
`BEC_collapse_final.parquet`	4.5M	Collapsed tender-item dataset
`Firms_final.parquet`	39.6K	Firm registry (CNPJ, CNAE, size, location)
`LOSERS_rebuilt.parquet`	85K	FL counts per tender-item
`FREQ_PARTICIP_rebuilt.parquet`	16.8K	Always-losers with participation counts
`firm_tender_map.parquet`	16.8M	Firm × tender participation + won flag
`firm_loss_stats.parquet`	41K	Per-firm aggregated stats (win rate, always-loser flag)
`bid_level_full.parquet`	40M	Raw bid-level data (forensic-recoverable layer)

CADE validation data¶

File	Rows	Description
`cade_carteis_licitacoes_2009_2019.csv`	65	CADE cartel convictions
`cade_bec_crossmatch.csv`	49	CADE firms matched to BEC
`cade_fl_cobidders.csv`	193	Always-loser firms co-bidding with CADE defendants

Data access

BEC procurement data is publicly available through the São Paulo state transparency portal. The processed Parquet files are built from the raw Stata .dta export via 00_build_bidlevel.py.

Frequent-loser construct¶

Step	Definition
Always-losers	Firms with `win_rate == 0` across all 2009–2019 tenders (≈16,843 firms)
IQR threshold	`median + 1.5 × IQR` of always-loser participation counts ≈ 14
Frequent losers	Always-losers above the threshold → 2,735 firms
Treatment	`losers = 1` if a tender-item has ≥1 FL participant

Threshold convention

The threshold is median + 1.5 × IQR (not the standard Tukey Q3 + 1.5 × IQR). This is intentional and is preserved across 00_build_bidlevel.py, 04_figures.R, and 05_robustness.R.

Running the Analysis¶

Step 1 — data build (Python, ~6 min, only if Parquets are missing)¶

cd paper3-frequent-losers
python3 scripts/00_build_bidlevel.py

Step 2 — full R pipeline (~8 min on 16 cores)¶

cd paper3-frequent-losers
Rscript scripts/00_master.R

The master script runs the core pipeline sequentially as subprocesses:

Script	Purpose	Key output
`01_clean.R`	Load Parquets, extract BEC keys, merge LOSERS, filter	`/tmp/p3_prepared.rds`
`02_analysis.R`	Main regressions (4 DVs × 4 specs)	`/tmp/p3_models.rds`
`03_tables.R`	LaTeX tables	`output/tables/tab_*.tex`
`04_figures.R`	FL distribution, IQR threshold, summary figures	`output/figures/fig_*.pdf`
`05_robustness.R`	Threshold sensitivity, continuous treatment, clustering, sensemakr, placebo	`tab_threshold_*.tex`
`06_did_temporal.R`	Callaway & Sant'Anna staggered DiD	`tab_did_temporal.tex`
`07_heterogeneity.R`	By procedure type, buyer size, item group	`tab_heterogeneity*.tex`
`08_additional_dvs.R`	Price ratio, procedure duration	`tab_additional_dvs.tex`
`09_matching.R`	CEM + IPW matching, balance table	`tab_matching.tex`
`10_fl_characteristics.R`	FL firm characterization (size, age, CNAE)	`tab_fl_characteristics.tex`

Validation, gate, and architecture scripts

The discrimination, leakage-audit, horse-race, gate-diagnostic, and sequential-gatekeeping results are produced by additional numbered scripts (e.g. 31_imhof_full_pipeline.R, 40_leakage_audit_d3.R, 43_precision_at_k_audit.R, and the architecture/gatekeeper scripts). Each \val* macro in values.tex carries an explicit % src: line naming the producing script.

Step 3 — manuscript compilation¶

cd paper3-frequent-losers/work/v20-editor/submission_clean
pdflatex -interaction=nonstopmode paper_submission_clean.tex
bibtex paper_submission_clean
pdflatex paper_submission_clean.tex
pdflatex paper_submission_clean.tex

Bibliography

The manuscript uses natbib + bibtex with the chicago style (NOT biblatex/biber). Use bibtex for the bibliography pass.

Reproducibility Discipline¶

Component	Specification
Macro binding	Every numeric claim is bound to a `\val*` macro in `values.tex`; no numerals are hardcoded into prose, captions, or table cells.
Provenance	Each macro carries a `% src:` comment naming the script and output CSV that produced it.
Caching	`/tmp/p3_prepared.rds` (analysis dataset) and `/tmp/p3_models.rds` (fitted models) for fast reload.
Threads	`fixest` and `data.table` use `min(detectCores(logical = FALSE), 16)`; DuckDB joins use `PRAGMA threads = 12`.

Computational Environment¶

Component	Specification
OS	WSL2 on Windows (kernel 6.6)
CPU	12–16 cores
RAM	21 GiB
R	4.5
Python	3.12

Runtime

The full pipeline takes approximately 8 minutes on the reference system. The most memory-intensive steps operate on the 40M-row bid-level data; these use DuckDB for out-of-core processing.