Replication¶
Requirements¶
| Component | Version |
|---|---|
| R | 4.5+ |
fixest |
0.12+ |
data.table |
1.15+ |
arrow |
14+ |
ggplot2 |
3.5+ |
scales |
1.3+ |
grf |
2.3+ |
quantreg |
5.98+ |
gridExtra |
2.3+ |
Data¶
The primary dataset is administrative data from BEC (Bolsa Eletronica de Compras), the electronic procurement platform for the state of Sao Paulo. The data contains 373 columns covering all standardized goods procurement from January 2016 to December 2019.
Data access
The raw data files are not publicly available due to confidentiality agreements. Researchers interested in replication should contact the authors directly.
Pipeline¶
The full analysis pipeline runs from a single master script:
# From the project root directory
Rscript scripts/00_master.R
This executes the following scripts in sequence, each as a separate R subprocess:
| Script | Purpose | Duration |
|---|---|---|
01_clean.R |
CSV to parquet conversion, variable creation | ~5 min (first run) |
02_analysis.R |
24 DiDiR regressions + 4 event studies | ~30 sec |
05_robustness.R |
Placebo, alt. clustering, winsorization, permutation | ~45 sec |
06_extensions.R |
Real prices, extensive margin, efficiency, heterogeneity | ~30 sec |
07_advanced.R |
HonestDiD, Lee bounds, causal forest, quantile DiD, Gelbach | ~3 min |
03_tables.R |
18 LaTeX tables | ~5 sec |
04_figures.R |
15 PDF figures | ~10 sec |
Output Structure¶
output/
├── tables/ # 18 .tex files (threeparttable + booktabs)
│ ├── tab_desc_stats.tex
│ ├── tab_prices.tex
│ ├── tab_participants.tex
│ ├── tab_validbids.tex
│ ├── tab_distance.tex
│ ├── tab_placebo.tex
│ ├── tab_altcluster.tex
│ ├── tab_winsorize.tex
│ ├── tab_prices_real.tex
│ ├── tab_extensive.tex
│ ├── tab_efficiency.tex
│ ├── tab_sme_winner.tex
│ ├── tab_heterog_pbu.tex
│ ├── tab_heterog_value.tex
│ ├── tab_lee_bounds.tex
│ ├── tab_cforest.tex
│ ├── tab_quantile_did.tex
│ └── tab_mediation.tex
└── figures/ # 15 .pdf files (grayscale, cairo)
├── fig_01_logprices_es.pdf
├── fig_02_distance_es.pdf
├── fig_03_numfirms_es.pdf
├── fig_04_numbids_es.pdf
├── fig_05_trends_prices.pdf
├── fig_06_trends_firms.pdf
├── fig_07_trends_bids.pdf
├── fig_08_trends_distance.pdf
├── fig_09_permutation.pdf
├── fig_10_sme_share.pdf
├── fig_11_honestdid.pdf
├── fig_12_cforest_varimp.pdf
├── fig_13_cforest_gate.pdf
├── fig_14_quantile_did.pdf
└── fig_15_mediation.pdf
Manuscript Compilation¶
cd manuscript
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex
Technical Notes¶
-
Memory management: Each pipeline script runs as a separate R subprocess to prevent OOM on systems with 15 GB RAM. The
fixestlean estimation mode (setFixest_estimation(lean = TRUE)) reduces model storage from ~4 GB to ~2.5 MB. -
Parquet cache: The first run reads the 6.4 GB CSV file and creates a parquet cache (~73 columns). Subsequent runs load directly from parquet (~5 seconds vs. ~5 minutes).
-
Thread configuration: Both
fixestanddata.tableuse 16 threads by default. Adjust inscripts/utils.Rif running on a machine with fewer cores.