Skip to content

Replication

Requirements

Component Version
R 4.5+
fixest 0.12+
data.table 1.15+
arrow 14+
ggplot2 3.5+
scales 1.3+
grf 2.3+
quantreg 5.98+
gridExtra 2.3+

Data

The primary dataset is administrative data from BEC (Bolsa Eletronica de Compras), the electronic procurement platform for the state of Sao Paulo. The data contains 373 columns covering all standardized goods procurement from January 2016 to December 2019.

Data access

The raw data files are not publicly available due to confidentiality agreements. Researchers interested in replication should contact the authors directly.

Pipeline

The full analysis pipeline runs from a single master script:

# From the project root directory
Rscript scripts/00_master.R

This executes the following scripts in sequence, each as a separate R subprocess:

Script Purpose Duration
01_clean.R CSV to parquet conversion, variable creation ~5 min (first run)
02_analysis.R 24 DiDiR regressions + 4 event studies ~30 sec
05_robustness.R Placebo, alt. clustering, winsorization, permutation ~45 sec
06_extensions.R Real prices, extensive margin, efficiency, heterogeneity ~30 sec
07_advanced.R HonestDiD, Lee bounds, causal forest, quantile DiD, Gelbach ~3 min
03_tables.R 18 LaTeX tables ~5 sec
04_figures.R 15 PDF figures ~10 sec

Output Structure

output/
├── tables/           # 18 .tex files (threeparttable + booktabs)
│   ├── tab_desc_stats.tex
│   ├── tab_prices.tex
│   ├── tab_participants.tex
│   ├── tab_validbids.tex
│   ├── tab_distance.tex
│   ├── tab_placebo.tex
│   ├── tab_altcluster.tex
│   ├── tab_winsorize.tex
│   ├── tab_prices_real.tex
│   ├── tab_extensive.tex
│   ├── tab_efficiency.tex
│   ├── tab_sme_winner.tex
│   ├── tab_heterog_pbu.tex
│   ├── tab_heterog_value.tex
│   ├── tab_lee_bounds.tex
│   ├── tab_cforest.tex
│   ├── tab_quantile_did.tex
│   └── tab_mediation.tex
└── figures/          # 15 .pdf files (grayscale, cairo)
    ├── fig_01_logprices_es.pdf
    ├── fig_02_distance_es.pdf
    ├── fig_03_numfirms_es.pdf
    ├── fig_04_numbids_es.pdf
    ├── fig_05_trends_prices.pdf
    ├── fig_06_trends_firms.pdf
    ├── fig_07_trends_bids.pdf
    ├── fig_08_trends_distance.pdf
    ├── fig_09_permutation.pdf
    ├── fig_10_sme_share.pdf
    ├── fig_11_honestdid.pdf
    ├── fig_12_cforest_varimp.pdf
    ├── fig_13_cforest_gate.pdf
    ├── fig_14_quantile_did.pdf
    └── fig_15_mediation.pdf

Manuscript Compilation

cd manuscript
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex

Technical Notes

  • Memory management: Each pipeline script runs as a separate R subprocess to prevent OOM on systems with 15 GB RAM. The fixest lean estimation mode (setFixest_estimation(lean = TRUE)) reduces model storage from ~4 GB to ~2.5 MB.

  • Parquet cache: The first run reads the 6.4 GB CSV file and creates a parquet cache (~73 columns). Subsequent runs load directly from parquet (~5 seconds vs. ~5 minutes).

  • Thread configuration: Both fixest and data.table use 16 threads by default. Adjust in scripts/utils.R if running on a machine with fewer cores.