Skip to content

Replication

This page describes how to reproduce every table and figure in the paper.


Replication Package

The package contains all code, processed datasets, and manuscript source files needed to reproduce the results. The pipeline runs in two stages: a Python data build that produces the analysis Parquets once, and an R analysis pipeline driven by a single master script.

Repository structure

Materials live under paper3-frequent-losers/. scripts/ contains the Python build and the numbered R pipeline; data/processed/ holds the Parquet inputs (git-ignored); output/tables/ and output/figures/ hold generated artifacts; the submission-clean manuscript source is in work/v20-editor/submission_clean/.


Software Requirements

Python (data build)

Software Version Purpose
Python 3.12 Data preprocessing
pandas / pyarrow latest Parquet I/O
duckdb latest Out-of-core joins and aggregation

R (analysis pipeline)

Software Version Purpose
R 4.5+ Statistical computing
fixest latest High-dimensional fixed effects (OpenMP, 12 threads)
data.table + arrow latest Fast manipulation and Parquet reads
duckdb latest Joins and aggregation over Parquet
modelsummary + kableExtra latest LaTeX regression tables
ggplot2 latest Publication-quality figures
pROC latest AUC and DeLong tests
sensemakr latest Cinelli & Hazlett (2020) sensitivity
MatchIt latest CEM / IPW matching
did latest Callaway & Sant'Anna (2021) staggered DiD

Manuscript

Software Version Purpose
LaTeX TeX Live 2024+ Typesetting
elsarticle latest Journal document class
chicago latest Bibliography style (natbib + bibtex)

Data Sources

Primary datasets

The pipeline reads from data/processed/ (built once by the Python stage):

File Rows Description
BEC_collapse_final.parquet 4.5M Collapsed tender-item dataset
Firms_final.parquet 39.6K Firm registry (CNPJ, CNAE, size, location)
LOSERS_rebuilt.parquet 85K FL counts per tender-item
FREQ_PARTICIP_rebuilt.parquet 16.8K Always-losers with participation counts
firm_tender_map.parquet 16.8M Firm × tender participation + won flag
firm_loss_stats.parquet 41K Per-firm aggregated stats (win rate, always-loser flag)
bid_level_full.parquet 40M Raw bid-level data (forensic-recoverable layer)

CADE validation data

File Rows Description
cade_carteis_licitacoes_2009_2019.csv 65 CADE cartel convictions
cade_bec_crossmatch.csv 49 CADE firms matched to BEC
cade_fl_cobidders.csv 193 Always-loser firms co-bidding with CADE defendants

Data access

BEC procurement data is publicly available through the São Paulo state transparency portal. The processed Parquet files are built from the raw Stata .dta export via 00_build_bidlevel.py.

Frequent-loser construct

Step Definition
Always-losers Firms with win_rate == 0 across all 2009–2019 tenders (≈16,843 firms)
IQR threshold median + 1.5 × IQR of always-loser participation counts ≈ 14
Frequent losers Always-losers above the threshold → 2,735 firms
Treatment losers = 1 if a tender-item has ≥1 FL participant

Threshold convention

The threshold is median + 1.5 × IQR (not the standard Tukey Q3 + 1.5 × IQR). This is intentional and is preserved across 00_build_bidlevel.py, 04_figures.R, and 05_robustness.R.


Running the Analysis

Step 1 — data build (Python, ~6 min, only if Parquets are missing)

cd paper3-frequent-losers
python3 scripts/00_build_bidlevel.py

Step 2 — full R pipeline (~8 min on 16 cores)

cd paper3-frequent-losers
Rscript scripts/00_master.R

The master script runs the core pipeline sequentially as subprocesses:

Script Purpose Key output
01_clean.R Load Parquets, extract BEC keys, merge LOSERS, filter /tmp/p3_prepared.rds
02_analysis.R Main regressions (4 DVs × 4 specs) /tmp/p3_models.rds
03_tables.R LaTeX tables output/tables/tab_*.tex
04_figures.R FL distribution, IQR threshold, summary figures output/figures/fig_*.pdf
05_robustness.R Threshold sensitivity, continuous treatment, clustering, sensemakr, placebo tab_threshold_*.tex
06_did_temporal.R Callaway & Sant'Anna staggered DiD tab_did_temporal.tex
07_heterogeneity.R By procedure type, buyer size, item group tab_heterogeneity*.tex
08_additional_dvs.R Price ratio, procedure duration tab_additional_dvs.tex
09_matching.R CEM + IPW matching, balance table tab_matching.tex
10_fl_characteristics.R FL firm characterization (size, age, CNAE) tab_fl_characteristics.tex

Validation, gate, and architecture scripts

The discrimination, leakage-audit, horse-race, gate-diagnostic, and sequential-gatekeeping results are produced by additional numbered scripts (e.g. 31_imhof_full_pipeline.R, 40_leakage_audit_d3.R, 43_precision_at_k_audit.R, and the architecture/gatekeeper scripts). Each \val* macro in values.tex carries an explicit % src: line naming the producing script.

Step 3 — manuscript compilation

cd paper3-frequent-losers/work/v20-editor/submission_clean
pdflatex -interaction=nonstopmode paper_submission_clean.tex
bibtex paper_submission_clean
pdflatex paper_submission_clean.tex
pdflatex paper_submission_clean.tex

Bibliography

The manuscript uses natbib + bibtex with the chicago style (NOT biblatex/biber). Use bibtex for the bibliography pass.


Reproducibility Discipline

Component Specification
Macro binding Every numeric claim is bound to a \val* macro in values.tex; no numerals are hardcoded into prose, captions, or table cells.
Provenance Each macro carries a % src: comment naming the script and output CSV that produced it.
Caching /tmp/p3_prepared.rds (analysis dataset) and /tmp/p3_models.rds (fitted models) for fast reload.
Threads fixest and data.table use min(detectCores(logical = FALSE), 16); DuckDB joins use PRAGMA threads = 12.

Computational Environment

Component Specification
OS WSL2 on Windows (kernel 6.6)
CPU 12–16 cores
RAM 21 GiB
R 4.5
Python 3.12

Runtime

The full pipeline takes approximately 8 minutes on the reference system. The most memory-intensive steps operate on the 40M-row bid-level data; these use DuckDB for out-of-core processing.