Replication¶

This page describes how to replicate the results presented in the paper.

Replication Package¶

The full replication package includes all code, processed datasets, and manuscript source files needed to reproduce every table and figure in the paper.

Repository Structure

The replication materials are organized in version-specific directories. The v4 directory contains the current primary analysis, with v2 and v3 containing earlier implementations.

Software Requirements¶

Python (Data Build)¶

Software	Version	Purpose
Python	3.10+	Data preprocessing
`pandas`	latest	Data manipulation
`pyarrow`	latest	Parquet I/O

R (Analysis Pipeline)¶

Software	Version	Purpose
R	4.5+	Statistical computing
`fixest`	latest	High-dimensional fixed effects, IV, Sun & Abraham
`data.table`	latest	Fast data manipulation
`arrow`	latest	Reading Parquet files
`modelsummary`	latest	Regression tables
`kableExtra`	latest	LaTeX table formatting
`ggplot2`	latest	Publication-quality figures
`sensemakr`	latest	Cinelli & Hazlett (2020) sensitivity
`did`	latest	Callaway & Sant'Anna (2021) DiD
`HonestDiD`	latest	Rambachan & Roth (2023) bounds
`MatchIt`	latest	Matching estimators

Manuscript¶

Software	Version	Purpose
LaTeX	TeX Live 2024+	Document typesetting
`elsarticle`	latest	Journal document class
`chicago`	latest	Bibliography style

Data Sources¶

Primary Datasets¶

The pipeline reads from data/processed/:

File	Rows	Description
`BEC_collapse_final.parquet`	4.5M	Collapsed tender-item dataset
`Firms_final.parquet`	39.6K	Firm registry (CNPJ, CNAE, size)
`LOSERS_rebuilt.parquet`	85K	FL counts per tender-item
`FREQ_PARTICIP_rebuilt.parquet`	16.8K	Always-losers with participation counts
`firm_tender_map.parquet`	16.8M	Firm x tender participation + won flag
`firm_loss_stats.parquet`	41K	Per-firm aggregated stats
`bid_level_full.parquet`	40M	Raw bid-level data

CADE Validation Data¶

File	Rows	Description
`cade_carteis_licitacoes_2009_2019.csv`	65	CADE cartel convictions
`cade_bec_crossmatch.csv`	49	CADE firms matched to BEC
`cade_fl_cobidders.csv`	193	FL firms co-bidding with CADE cartelists

Data Access

The BEC procurement data is publicly available through the Sao Paulo state transparency portal. The processed Parquet files are built from raw Stata .dta files via 00_build_bidlevel.py.

Key Variables¶

Variable	Description
`losers`	FL presence indicator (1 = at least one FL bidder)
`lneg_price`	Log negotiated price (DV)
`ln_firms`	Log number of firms
`ln_bids`	Log number of bids
`ln_firms_excl`	Log number of non-FL firms
`fl_supply_loo`	Leave-one-out FL supply instrument
`item_f`	Item fixed effect
`year_f`	Year fixed effect
`pbu_f`	PBU fixed effect
`convite`	Procurement modality indicator

Running the Analysis¶

Step 1: Data Build (Python, ~6 min)¶

# Only needed if Parquet files are missing
cd paper3-frequent-losers
python3 scripts/00_build_bidlevel.py

Step 2: Full R Pipeline (~8 min on 16 cores)¶

Rscript v4/code/00_master.R

This executes 13 scripts sequentially:

Script	Purpose	Key Output
`01_data_prep.R`	Load Parquets, merge, filter	`/tmp/p3_prepared.rds`
`02_network_analysis.R`	Co-bidding networks, FL classification	Network metrics
`03_cade_validation.R`	CADE cross-match, permutation test	`tab_cade_permutation.tex`
`04_iv_regressions.R`	2SLS, balance tests, placebo IV	`tab_iv_main.tex`
`05_main_regressions.R`	OLS, network split, interactions	`tab_prices.tex`
`06_bajari_ye_test.R`	Exchangeability, independence, placebo	`tab_bajari_ye.tex`
`07_mechanisms.R`	Selection, calibration, reverse causality	`tab_mechanisms.tex`
`08_did_revised.R`	Callaway & Sant'Anna, Rambachan--Roth	`tab_did_revised.tex`
`09_regime_test.R`	Regime 1 vs 2 simulation	`tab_regime_test.tex`
`10_robustness.R`	Threshold, matching, clustering, CV	Multiple tables
`11_welfare_bounds.R`	OLS/IV/cross-fit welfare estimates	`tab_welfare_bounds.tex`
`12_tables.R`	Compile publication-ready tables	All `.tex` tables
`13_figures.R`	Generate all figures	All `.pdf` figures

Step 3: Manuscript Compilation¶

cd v4/manuscript
pdflatex -interaction=nonstopmode paper_v4.tex
bibtex paper_v4
pdflatex paper_v4.tex
pdflatex paper_v4.tex

Output Files¶

Tables¶

Directory	Format	Count	Description
`v4/output/tables/`	LaTeX (.tex)	35	All regression tables (booktabs + threeparttable)

Figures¶

Directory	Format	Count	Description
`v4/output/figures/`	PDF	17	All publication-quality figures

Manuscript¶

File	Pages	Description
`v4/manuscript/paper_v4.pdf`	59	Complete manuscript with appendix

Computational Environment¶

Component	Specification
OS	Ubuntu 24.04 (WSL2 on Windows)
CPU	16 cores
RAM	15 GB
R	4.5
fixest	OpenMP for parallel estimation (16 threads)
data.table	Multi-threaded (`setDTthreads(16)`)

Runtime

The full pipeline (00_master.R) takes approximately 8 minutes on the reference system. The most time-intensive steps are 06_bajari_ye_test.R (bid-level analysis on 40M rows) and 08_did_revised.R (staggered DiD with 144K market-year observations).

Caching

Intermediate data is cached at /tmp/ for fast reload. The prepared dataset (/tmp/p3_prepared.rds, ~800 MB) is created by 01_data_prep.R and reused by all subsequent scripts.