Replication¶
Everything is reproducible---from raw data to final PDF. Start here.
Replication Package¶
The replication package contains all code, processed datasets, and manuscript sources needed to reproduce every table and figure from scratch.
Repository Structure
The replication materials are organized under papers_finais/paper1_screening/. The code/ directory contains the analysis pipeline, manuscript/v4/ the LaTeX source, and output/ the generated tables and figures.
Software Requirements¶
Python (Data Build)¶
| Software | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Data preprocessing |
pandas |
latest | Data manipulation |
pyarrow |
latest | Parquet I/O |
R (Analysis Pipeline)¶
| Software | Version | Purpose |
|---|---|---|
| R | 4.5+ | Statistical computing |
fixest |
latest | High-dimensional fixed effects, IV, Sun & Abraham |
data.table |
latest | Fast data manipulation |
arrow |
latest | Reading Parquet files |
modelsummary |
latest | Regression tables |
kableExtra |
latest | LaTeX table formatting |
ggplot2 |
latest | Publication-quality figures |
sensemakr |
latest | Cinelli & Hazlett (2020) sensitivity |
did |
latest | Callaway & Sant'Anna (2021) DiD |
HonestDiD |
latest | Rambachan & Roth (2023) bounds |
MatchIt |
latest | Matching estimators |
survival |
latest | Cox proportional hazards |
Manuscript¶
| Software | Version | Purpose |
|---|---|---|
| LaTeX | TeX Live 2024+ | Document typesetting |
elsarticle |
latest | Journal document class |
chicago |
latest | Bibliography style (natbib) |
Data Sources¶
Primary Datasets¶
The pipeline reads from data/processed/:
| File | Rows | Description |
|---|---|---|
BEC_collapse_final.parquet |
4.5M | Collapsed tender-item dataset |
Firms_final.parquet |
39.6K | Firm registry (CNPJ, CNAE, size) |
LOSERS_rebuilt.parquet |
85K | FL counts per tender-item |
FREQ_PARTICIP_rebuilt.parquet |
16.8K | Always-losers with participation counts |
firm_tender_map.parquet |
16.8M | Firm x tender participation + won flag |
firm_loss_stats.parquet |
41K | Per-firm aggregated stats |
bid_level_full.parquet |
40M | Raw bid-level data |
CADE Validation Data¶
| File | Rows | Description |
|---|---|---|
cade_carteis_licitacoes_2009_2019.csv |
65 | CADE cartel convictions |
cade_bec_crossmatch.csv |
49 | CADE firms matched to BEC |
cade_fl_cobidders.csv |
193 | FL firms co-bidding with CADE cartelists |
Data Access
The BEC procurement data is publicly available through the Sao Paulo state transparency portal. The processed Parquet files are built from raw Stata .dta files via 00_build_bidlevel.py.
Key Variables¶
| Variable | Description |
|---|---|
losers |
FL presence indicator (1 = at least one FL bidder) |
lneg_price |
Log negotiated price (DV) |
ln_firms |
Log number of firms |
ln_bids |
Log number of bids |
ln_firms_excl |
Log number of non-FL firms |
fl_supply_loo |
Leave-one-out FL supply instrument |
item_f |
Item fixed effect |
year_f |
Year fixed effect |
pbu_f |
PBU fixed effect |
convite |
Procurement modality indicator |
Running the Analysis¶
Step 1: Data Build (Python, ~6 min)¶
# Only needed if Parquet files are missing
cd paper3-frequent-losers
python3 scripts/00_build_bidlevel.py
Step 2: Full R Pipeline (~8 min on 16 cores)¶
cd papers_finais/paper1_screening
Rscript code/00_master_v4.R
This executes 22 scripts sequentially, plus figures_new.R for the v5 figures:
| Script | Purpose | Key Output |
|---|---|---|
00_setup.R |
Package loading, thread config, paths | Environment setup |
01_data_prep.R |
Load Parquets, merge, filter | /tmp/p3v4_prepared.rds |
02_network_analysis.R |
Co-bidding networks, FL classification | Network metrics |
03_iv_construction.R |
Build leave-one-out instrument | IV variables |
04_iv_regressions.R |
2SLS, balance tests, placebo IV | tab_iv_main.tex, tab_iv_balance.tex |
05_main_regressions.R |
OLS, network split, interactions | tab_prices.tex, tab_network_split.tex |
06_bajari_ye_test.R |
Exchangeability, independence, placebo | tab_bajari_ye_corrected.tex |
07_mechanism_tests.R |
Selection, calibration, reverse causality | tab_mechanisms.tex |
08_did_revised.R |
Callaway & Sant'Anna, Rambachan--Roth | tab_did_revised.tex |
09_cade_permutation.R |
CADE cross-match, permutation test | tab_cade_permutation.tex |
10_robustness.R |
Threshold, matching, clustering, CV | Multiple tables |
11_welfare_bounds.R |
Cost-benefit calculations | tab_welfare_bounds.tex |
12_tables.R |
Compile publication-ready tables | All .tex tables |
13_figures.R |
Generate all figures | All .pdf figures |
14_quality_checks.R |
Consistency diagnostics | Validation logs |
15_fl_definition_robustness.R |
Low-win-rate variants, cross-fit | tab_fl_lowwinrate.tex |
16_regime_test.R |
Regime 1 vs 2 simulation and BIC | tab_structural_params.tex |
17_stacked_did.R |
Stacked DiD estimator | tab_stacked_did.tex |
18_oster_delta.R |
Oster coefficient stability | tab_oster_delta.tex |
19_dyadic_permutation.R |
FL--winner pair permutation test | tab_dyadic_permutation.tex |
20_cox_survival.R |
Cox proportional hazards model | tab_cox_survival.tex |
21_network_graph.R |
Network visualization | fig_network_graph.pdf |
22_threshold_heatmap.R |
2D threshold sensitivity heatmap | fig_threshold_heatmap.pdf |
figures_new.R |
Corner solution, dispersion paradox | New v5 figures |
Step 3: Manuscript Compilation¶
cd manuscript/v4
pdflatex -interaction=nonstopmode paper_screening_v4.tex
bibtex paper_screening_v4
pdflatex paper_screening_v4.tex
pdflatex paper_screening_v4.tex
Bibliography
The manuscript uses natbib/bibtex (NOT biblatex/biber). Use bibtex for the bibliography pass.
Output Files¶
Tables¶
| Directory | Format | Count | Description |
|---|---|---|---|
output/tables/ |
LaTeX (.tex) | 35+ | All regression tables (booktabs + threeparttable) |
Figures¶
| Directory | Format | Count | Description |
|---|---|---|---|
output/figures/ |
26 | All publication-quality figures |
Manuscript¶
| File | Pages | Description |
|---|---|---|
manuscript/v4/paper_screening_v4.pdf |
~56 | Complete manuscript with appendix |
Computational Environment¶
| Component | Specification |
|---|---|
| OS | Ubuntu 24.04 (WSL2 on Windows) |
| CPU | 16 cores |
| RAM | 15 GB |
| R | 4.5 |
| fixest | OpenMP for parallel estimation (16 threads) |
| data.table | Multi-threaded (setDTthreads(16)) |
Runtime
The full pipeline takes approximately 8 minutes on the reference system. The most time-intensive steps are 06_bajari_ye_test.R (bid-level analysis on 40M rows) and 08_did_revised.R (staggered DiD with 144K market-year observations).
Caching
Intermediate data is cached at /tmp/ for fast reload. The prepared dataset (/tmp/p3v4_prepared.rds, ~800 MB) is created by 01_data_prep.R and reused by all subsequent scripts.