Replication¶
This page describes how to replicate the results presented in the paper.
Replication Package¶
The full replication package includes all code, processed datasets, and manuscript source files needed to reproduce every table and figure in the paper.
Repository Structure
The replication materials are organized in version-specific directories. The v4 directory contains the current primary analysis, with v2 and v3 containing earlier implementations.
Software Requirements¶
Python (Data Build)¶
| Software | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Data preprocessing |
pandas |
latest | Data manipulation |
pyarrow |
latest | Parquet I/O |
R (Analysis Pipeline)¶
| Software | Version | Purpose |
|---|---|---|
| R | 4.5+ | Statistical computing |
fixest |
latest | High-dimensional fixed effects, IV, Sun & Abraham |
data.table |
latest | Fast data manipulation |
arrow |
latest | Reading Parquet files |
modelsummary |
latest | Regression tables |
kableExtra |
latest | LaTeX table formatting |
ggplot2 |
latest | Publication-quality figures |
sensemakr |
latest | Cinelli & Hazlett (2020) sensitivity |
did |
latest | Callaway & Sant'Anna (2021) DiD |
HonestDiD |
latest | Rambachan & Roth (2023) bounds |
MatchIt |
latest | Matching estimators |
Manuscript¶
| Software | Version | Purpose |
|---|---|---|
| LaTeX | TeX Live 2024+ | Document typesetting |
elsarticle |
latest | Journal document class |
chicago |
latest | Bibliography style |
Data Sources¶
Primary Datasets¶
The pipeline reads from data/processed/:
| File | Rows | Description |
|---|---|---|
BEC_collapse_final.parquet |
4.5M | Collapsed tender-item dataset |
Firms_final.parquet |
39.6K | Firm registry (CNPJ, CNAE, size) |
LOSERS_rebuilt.parquet |
85K | FL counts per tender-item |
FREQ_PARTICIP_rebuilt.parquet |
16.8K | Always-losers with participation counts |
firm_tender_map.parquet |
16.8M | Firm x tender participation + won flag |
firm_loss_stats.parquet |
41K | Per-firm aggregated stats |
bid_level_full.parquet |
40M | Raw bid-level data |
CADE Validation Data¶
| File | Rows | Description |
|---|---|---|
cade_carteis_licitacoes_2009_2019.csv |
65 | CADE cartel convictions |
cade_bec_crossmatch.csv |
49 | CADE firms matched to BEC |
cade_fl_cobidders.csv |
193 | FL firms co-bidding with CADE cartelists |
Data Access
The BEC procurement data is publicly available through the Sao Paulo state transparency portal. The processed Parquet files are built from raw Stata .dta files via 00_build_bidlevel.py.
Key Variables¶
| Variable | Description |
|---|---|
losers |
FL presence indicator (1 = at least one FL bidder) |
lneg_price |
Log negotiated price (DV) |
ln_firms |
Log number of firms |
ln_bids |
Log number of bids |
ln_firms_excl |
Log number of non-FL firms |
fl_supply_loo |
Leave-one-out FL supply instrument |
item_f |
Item fixed effect |
year_f |
Year fixed effect |
pbu_f |
PBU fixed effect |
convite |
Procurement modality indicator |
Running the Analysis¶
Step 1: Data Build (Python, ~6 min)¶
# Only needed if Parquet files are missing
cd paper3-frequent-losers
python3 scripts/00_build_bidlevel.py
Step 2: Full R Pipeline (~8 min on 16 cores)¶
Rscript v4/code/00_master.R
This executes 13 scripts sequentially:
| Script | Purpose | Key Output |
|---|---|---|
01_data_prep.R |
Load Parquets, merge, filter | /tmp/p3_prepared.rds |
02_network_analysis.R |
Co-bidding networks, FL classification | Network metrics |
03_cade_validation.R |
CADE cross-match, permutation test | tab_cade_permutation.tex |
04_iv_regressions.R |
2SLS, balance tests, placebo IV | tab_iv_main.tex |
05_main_regressions.R |
OLS, network split, interactions | tab_prices.tex |
06_bajari_ye_test.R |
Exchangeability, independence, placebo | tab_bajari_ye.tex |
07_mechanisms.R |
Selection, calibration, reverse causality | tab_mechanisms.tex |
08_did_revised.R |
Callaway & Sant'Anna, Rambachan--Roth | tab_did_revised.tex |
09_regime_test.R |
Regime 1 vs 2 simulation | tab_regime_test.tex |
10_robustness.R |
Threshold, matching, clustering, CV | Multiple tables |
11_welfare_bounds.R |
OLS/IV/cross-fit welfare estimates | tab_welfare_bounds.tex |
12_tables.R |
Compile publication-ready tables | All .tex tables |
13_figures.R |
Generate all figures | All .pdf figures |
Step 3: Manuscript Compilation¶
cd v4/manuscript
pdflatex -interaction=nonstopmode paper_v4.tex
bibtex paper_v4
pdflatex paper_v4.tex
pdflatex paper_v4.tex
Output Files¶
Tables¶
| Directory | Format | Count | Description |
|---|---|---|---|
v4/output/tables/ |
LaTeX (.tex) | 35 | All regression tables (booktabs + threeparttable) |
Figures¶
| Directory | Format | Count | Description |
|---|---|---|---|
v4/output/figures/ |
17 | All publication-quality figures |
Manuscript¶
| File | Pages | Description |
|---|---|---|
v4/manuscript/paper_v4.pdf |
59 | Complete manuscript with appendix |
Computational Environment¶
| Component | Specification |
|---|---|
| OS | Ubuntu 24.04 (WSL2 on Windows) |
| CPU | 16 cores |
| RAM | 15 GB |
| R | 4.5 |
| fixest | OpenMP for parallel estimation (16 threads) |
| data.table | Multi-threaded (setDTthreads(16)) |
Runtime
The full pipeline (00_master.R) takes approximately 8 minutes on the reference system. The most time-intensive steps are 06_bajari_ye_test.R (bid-level analysis on 40M rows) and 08_did_revised.R (staggered DiD with 144K market-year observations).
Caching
Intermediate data is cached at /tmp/ for fast reload. The prepared dataset (/tmp/p3_prepared.rds, ~800 MB) is created by 01_data_prep.R and reused by all subsequent scripts.