Skip to content

Replication

This page describes how to reproduce every table and figure in the paper, and the data-access and confidentiality position that governs what can be shared.


Data Access and Confidentiality

BEC microdata are not publicly redistributable

The BEC-SP procurement microdata underlying this paper are administrative records governed by the data provider's terms and are not publicly redistributable. The authors cannot post the raw or processed firm-level microdata as part of a public replication package. No raw CNPJ (firm-tax-identifier) values appear in any public material, including the derived frames described below.

What can be shared:

  • Full analysis code and the complete output → script map, so that every table, figure, and \val* macro is traceable to the program that produced it.
  • Derived, anonymized firm frames (e.g. per-firm participation counts, always-loser flags, and score values with identifiers replaced by opaque keys) sufficient to reproduce the downstream analysis without exposing raw identifiers.
  • CADE rulings, which are public: the adjudication anchors come from published CADE decisions and can be cited and re-derived from the public record.
  • Seeds, logs, and the computational environment described below.

What cannot be shared:

  • Raw or processed BEC microdata with firm identifiers.
  • Any artifact containing raw CNPJ values.

JLEO data policy and proprietary exemption

Consistent with JLEO policy, the authors will post data, programs, and logs within three years of publication unless a proprietary exemption applies. The BEC microdata fall under such a proprietary exemption: they are administrative records the authors are not licensed to redistribute. The authors cooperate fully with bona fide replication requests — including sharing code, the output → script map, anonymized derived frames, and guidance for obtaining the underlying data from the original provider.


Replication Package

The shareable package contains all code, the output → script map, anonymized derived frames, and manuscript source files needed to reproduce the results from a licensed copy of the BEC microdata. The pipeline runs in two stages: a Python data build that produces the analysis Parquets once, and an R analysis pipeline driven by a single master script.

Repository structure

Materials live under paper3-frequent-losers/. scripts/ contains the Python build and the numbered R pipeline; data/processed/ holds the Parquet inputs (git-ignored, not redistributable — built locally from a licensed BEC copy); output/tables/ and output/figures/ hold generated artifacts; the submission-clean manuscript source is in work/v22-editor/submission_clean/.

The submission is now a three-document package

The manuscript ships as three documents, each compiled separately from the same values.tex macro file and cross-referenced via xr:

Document Source Length Holds
Paper paper_submission_clean.tex ~44 pp Body: the over-crediting characterization (lead contribution), the modest enforcer stopping rule, the BEC audit, the federal portability leg
Online Appendix online_appendix_submission_clean.tex ~28 pp Data/labels, theory + survival, opportunity & timing audits, profile/score diagnostics, bid benchmark + cost–recall, price scope, federal audit battery
Online Supplement online_supplement_submission.tex ~23 pp Full robustness grids, permutation draws, threshold sweeps, the complete bid-feature dictionary, fold audits, the full price grid, the full federal-construction detail, and the long proofs — the heavy material migrated out of the body and appendix

The Online Supplement is new in this version: it absorbs the full grids and proofs so the paper and appendix stay tight. Compile the appendix and supplement before the paper so the xr cross-references resolve.


Software Requirements

Python (data build)

Software Version Purpose
Python 3.12 Data preprocessing
pandas / pyarrow latest Parquet I/O
duckdb latest Out-of-core joins and aggregation

R (analysis pipeline)

Software Version Purpose
R 4.5+ Statistical computing
fixest latest High-dimensional fixed effects (OpenMP, 12 threads)
data.table + arrow latest Fast manipulation and Parquet reads
duckdb latest Joins and aggregation over Parquet
modelsummary + kableExtra latest LaTeX regression tables
ggplot2 latest Publication-quality figures
pROC latest AUC and DeLong incremental tests
ranger latest Random forests for the bid-distribution / combined benchmark
survival latest Exit-margin / persistence model (Appendix B)
sensemakr latest Cinelli & Hazlett (2020) sensitivity
MatchIt latest CEM / IPW matching
did latest Callaway & Sant'Anna (2021) staggered DiD

Manuscript

Software Version Purpose
LaTeX TeX Live 2024+ Typesetting
elsarticle latest Journal document class
chicago latest Bibliography style (natbib + bibtex)

Data Sources

Primary datasets (built locally, not redistributable)

The pipeline reads from data/processed/ (built once by the Python stage from a licensed BEC microdata copy). These files contain firm identifiers and are git-ignored and not redistributable:

File Rows Description
BEC_collapse_final.parquet 4.5M Collapsed tender-item dataset
Firms_final.parquet 39.6K Firm registry (anonymized key, CNAE, size, location)
LOSERS_rebuilt.parquet 85K FL counts per tender-item
FREQ_PARTICIP_rebuilt.parquet 16.8K Always-losers with participation counts
firm_tender_map.parquet 16.8M Firm × tender participation + won flag
firm_loss_stats.parquet 41K Per-firm aggregated stats (win rate, always-loser flag)
bid_level_full.parquet 40M Raw bid-level data (forensic-recoverable layer)

These files are not shared

The tables above describe the local build only. They contain firm-level administrative records (built from a licensed BEC copy) and are not part of the public replication package. The anonymized derived frames that are shareable replace any raw identifier with an opaque key; no raw CNPJ value appears in any shared artifact.

CADE adjudication anchors (public record)

The adjudication anchors are derived from published CADE rulings, which are part of the public record. In the shareable package, any firm appearing in a derived crossmatch is referenced by an opaque anonymized key, never by raw CNPJ:

Derived frame Rows Description
CADE cartel rulings (public) 12 cases Adjudicated procurement-cartel cases used as legal anchors
Direct CADE defendants Direct defendants named in the rulings (legal anchors)
Adjudication-anchored cobidders 651 Always-loser firms that share ≥1 BEC tender-item with a BEC-active direct defendant (anonymized keys; not cartel members)

The label rule (canonical, non-circular)

The main validation label is the broad adjudication-anchored cobidder target: a unique always-loser firm that shares at least one BEC tender-item with a BEC-active direct CADE defendant. Direct defendants are excluded, and the frequent-loser flag is never used to construct the label. There are 651 positives (341 frequent-loser, 310 non-frequent-loser — the composition shows the label is not flag-conditioned). A contact-intensity sensitivity (≥ 2 shared tender-items) gives 368 positives. The cobidders are firms with adjudication-anchored exposure — they co-appeared with direct defendants in adjudicated environments — not cartel members, and the label inherits CADE's selection of which cartels to adjudicate.

Frequent-loser construct

Step Definition
Score sᵢ = log(1 + Tᵢ), the continuous ordering over participation counts T
Always-losers Firms with win_rate == 0 across all 2009–2019 tenders (16,843 firms)
Administrative cutoff FL14 = (Tᵢ ≥ 14), i.e. median + 1.5 × IQR of always-loser participation counts
Frequent losers Always-losers at or above the cutoff → 2,735 firms
Treatment losers = 1 if a tender-item has ≥1 FL participant

FL14 is administrative, not structural

FL14 is an administrative, auditable simplification of the continuous ordering sᵢ — never an ontologically special threshold. The cutoff median + 1.5 × IQR (not the standard Tukey Q3 + 1.5 × IQR) is intentional and is preserved across 00_build_bidlevel.py, 04_figures.R, and 05_robustness.R. The construct orders forensic priority; the binary cutoff is a deployable convenience, not the object of inference.


Federal (ComprasNet) Replication — the Second Platform

One config, two sources — --source={bec,comprasnet}

As of v23 the audit pipeline runs on two procurement platforms from a single code base. Every analysis script accepts a --source= flag (default bec) and reads all of its data paths, constants, key-extraction lambdas, and output directories from one abstraction layer, scripts/utils/source_config.R. There is no forked per-platform logic: BEC behaviour is byte-identical to the single-platform release, and the federal pipeline differs only where the data genuinely differ. To run the federal leg, every numbered script is invoked with --source=comprasnet; the orchestration is in scripts/build/run_federal_chain.sh.

Federal data provenance (public federal records)

The federal panel is assembled from two public federal sources — no licensed microdata and, importantly, no federal bid microdata: the public federal data are participation-only, so no bid-distribution / Imhof forensic benchmark is constructible federally (a platform-observability difference, documented as on-thesis).

Source Role Coverage
Portal da Transparência (CGU) bulk download Participation panel — who participated / won, buyer (UASG), item 2013–2019 (item_level_panel.parquet)
compras.dados.gov.br API (item_pregao) Price signals only (menorLance, valorEstimadoItem, valorHomologadoItem) parsed into a price panel; not used for any federal price claim in this submission

No federal price claim is made

The federal price panel is built (12.4M rows) but is a future-upgrade hook only. The current submission makes zero federal price claim; the federal leg is participation + winner-flag only.

Federal pipeline chain

The federal chain runs the same numbered scripts as BEC, 00 → 12b, each with --source=comprasnet, plus one federal-only robustness leg:

Script Federal role
00 … 12b (the BEC chain) Identical audit battery, re-pathed and re-keyed via source_config.R
13_srp_stratified_validation.R Federal-only SRP leg — stratifies the two federal Pregão variants (regular po_phase_code = 5 vs SRP po_phase_code = 9999) and checks the loser-side signal is consistent across them (referee attack A7)

Federal key semantics differ from BEC

BEC and ComprasNet share the same column names but not the same key semantics, which is precisely why a single config exists. Federal numerodaoc is a 9-char string whose trailing digits are the numbering year — which differs from the award year for ≈ 23% of rows — so the federal year is never string-extracted; it comes from a (numerodaoc, codigoitem) → year lookup. The federal buyer (codigo_ug, 6-digit UASG) is a separate column, not a substring. Convite is extinct federally (pure Pregão), so the BEC modality stratification is replaced by the SRP-vs-regular contrast.

Canonical federal targets (same IQR rule, re-estimated cut)

The federal targets are built by the same rules as BEC — the FL cut is re-estimated on the federal data, not transported from BEC:

Quantity Federal value BEC analogue
Participation panel 51.0M rows / 92,600 firms
Always-losers 35,943 16,843
FL cut (median + 1.5 × IQR, ) 326,491 frequent losers 14 → 2,735
Broad-rule cobidders 3,850
Broad-AL cobidders (main target) 195 (94 FL / 101 non-FL) 651
CADE cases (numbered) 7 12
Window 2013–2019 2009–2019

Partially overlapping legal anchors

The federal target is establishment-anchored on the same family of CADE cartels as the BEC portfolio. The 7 federal cases are the same cartels as BEC's, partially overlapping — so the two legs are correlated, not fully independent. The federal leg tests the portability of the audit protocol and the loser-side construct, not an independent ground truth; it is a second-platform demonstration, provisional pending genuinely independent anchors, not a promotion to "Confirmed." The cobidder count was rebuilt establishment-anchored (CADE linkage v3), with the establishment-vs-raiz grain difference verified to be zero.


Running the Analysis

Step 1 — data build (Python, ~6 min, only if Parquets are missing)

cd paper3-frequent-losers
python3 scripts/00_build_bidlevel.py

Step 2 — full R pipeline (~8 min on 16 cores)

cd paper3-frequent-losers
Rscript scripts/00_master.R

The master script runs the core pipeline sequentially as subprocesses:

Script Purpose Key output
00_build_canonical_validation_targets.R Build the canonical non-circular validation label (651 always-loser cobidders sharing ≥1 BEC tender-item with a BEC-active direct CADE defendant; FL flag never used) outputs/targets/canonical_target_counts.csv
01_clean.R Load Parquets, extract BEC keys, merge LOSERS, filter /tmp/p3_prepared.rds
02_analysis.R Main regressions (4 DVs × 4 specs) /tmp/p3_models.rds
03_tables.R LaTeX tables output/tables/tab_*.tex
04_figures.R FL distribution, IQR threshold, summary figures output/figures/fig_*.pdf
05_robustness.R Threshold sensitivity, continuous treatment, clustering, sensemakr, placebo tab_threshold_*.tex
06_did_temporal.R Callaway & Sant'Anna staggered DiD tab_did_temporal.tex
07_heterogeneity.R By procedure type, buyer size, item group tab_heterogeneity*.tex
08_additional_dvs.R Price ratio, procedure duration tab_additional_dvs.tex
09_matching.R CEM + IPW matching, balance table tab_matching.tex
10_fl_characteristics.R FL firm characterization (size, age, CNAE) tab_fl_characteristics.tex
12_audit_armor.R Anchor-agnostic armor battery: exposure tiers (observed / plug-in / firm-LOO / label-blind), within-stratum granularity sweep + positive control, powered permutation, label-frozen timing, regenerated defendant roles outputs/diagnostics/audit_armor/
12b_audit_armor_fixup.R Fixups / regeneration for the armor pack (\valArmor* macros) outputs/diagnostics/audit_armor/
analysis/14_overcrediting_inflation_sim.R Over-crediting bias as an estimable object — the lead contribution. Builds the synthetic surface of the raw-AUC inflation \(\Delta = \mathrm{AUC}_{\text{raw}} - \mathrm{AUC}_{\text{opp-adj}}\) over a grid in \(\mathrm{CV}(T)\) × adjudicated base rate. The surface is synthetic and anchored at one empirical point per platform — it is not an estimated curve fit to data. No confidential microdata enters: the only empirical inputs are two published scalars per platform outputs/diagnostics/inflation_surface.csv, output/figures/fig_inflation_surface.pdf

Canonical-label, decomposition, audit, and frontier scripts

The canonical validation label is built by 00_build_canonical_validation_targets.R (651 positives; FL flag never used), and the contact-intensity sensitivity (≥ 2 shared tender-items, 368 positives) by 02b_opportunity_sensitivity_contact2.R. The opportunity decomposition (raw vs label-blind opportunity vs within-stratum AUC and the nested-increment DeLong test), the permutation and negative-control audits, the leakage / contamination and timing audits, the leave-largest-case-out single-case-concentration audit, the bid-distribution benchmark, and the cost–recall frontier (K1 grid, firm-vs-bid-row denominators) are produced by additional numbered scripts (e.g. 31_imhof_full_pipeline.R, 40_leakage_audit_d3.R, the decomposition/timing scripts, and the architecture/frontier scripts). The anchor-agnostic armor battery — exposure tiers (observed contact 0.905 / plug-in 0.985 / firm-LOO 0.855 / label-blind 0.553), the within-stratum granularity sweep with planted positive control (0.953), the powered permutation, and the label-frozen timing benchmark — is produced by 12_audit_armor.R and 12b_audit_armor_fixup.R, writing to outputs/diagnostics/audit_armor/ (macros \valArmor*). Each \val* macro in values.tex carries an explicit % src: line naming the producing script and output CSV — the output → script map that travels with the shareable package.

Two propositions are theory, not data

The reframe adds two positive objects to the appendix framework, and both are analytic — they consume no microdata and have no producing data script. (1) The over-crediting bias \(\Delta\) (Proposition, online Supplement), now the paper's lead contribution: a size-bias characterization of why a contact-anchored validation over-credits a volume-loaded score — stated as signs only (increasing in \(\mathrm{CV}(T)\), decreasing in the adjudicated base rate), no closed-form magnitude, with \(\mathrm{CV}(T)\) as a pre-bid-file leading-order sufficient statistic / diagnostic (not a fix). The magnitude is read off the synthetic surface in analysis/14_overcrediting_inflation_sim.R above — a surface anchored at one empirical point per platform, not an estimated curve; the two platforms are two points on that one synthetic surface. (2) The enforcer stopping rule (Proposition, Appendix B), stated modestly as a standard cost–benefit (MB = MC) tangency: the agency descends the award ranking until marginal recovery per unit forensic cost falls to the cost–value ratio \(c/V\); sweeping \(c/V\) traces the cost–recall frontier, so the absence of a single fixed cutoff follows from the budget-dependent optimum. No data script is needed to reproduce either proposition — they are mathematical statements proved in the appendix and supplement.

Seeds

Every script that draws random numbers (bootstrap permutations, cross-validation folds, ranger forests, matching) sets an explicit seed at the top, so the audit results — including the B = 2,000 sham permutation, the K-fold CV, and the bootstrap intervals — reproduce bit-for-bit on a given environment.

Step 3 — manuscript compilation

cd paper3-frequent-losers/work/v22-editor/submission_clean
# Compile appendix + supplement first so the paper's xr cross-refs resolve.
for doc in online_appendix_submission_clean online_supplement_submission paper_submission_clean; do
  pdflatex -interaction=nonstopmode "$doc.tex"
  bibtex "$doc"
  pdflatex "$doc.tex"
  pdflatex "$doc.tex"
done

Bibliography

The manuscript uses natbib + bibtex with the chicago style (NOT biblatex/biber). Use bibtex for the bibliography pass. The three documents (paper, appendix, supplement) share one values.tex and resolve cross-references through xr / \externaldocument, so compile the appendix and supplement before the paper.


Reproducibility Discipline

Component Specification
Macro binding Every numeric claim is bound to a \val* macro in values.tex; no numerals are hardcoded into prose, captions, or table cells.
Provenance Each macro carries a % src: comment naming the script and output CSV that produced it.
Caching /tmp/p3_prepared.rds (analysis dataset) and /tmp/p3_models.rds (fitted models) for fast reload.
Threads fixest and data.table use min(detectCores(logical = FALSE), 16); DuckDB joins use PRAGMA threads = 12.
Determinism The two-source package enforces deterministic ordering; bit-exact reproduction requires that enforced ordering. A DuckDB parallel-aggregation / emission-order nondeterminism was fixed by stabilizing tie-breaks to a total order, so the audit reproduces bit-for-bit on a given environment across both sources.

Computational Environment

Component Specification
OS WSL2 on Windows (kernel 6.6)
CPU 12–16 cores
RAM 21 GiB
R 4.5
Python 3.12

Runtime

The full pipeline takes approximately 8 minutes on the reference system. The most memory-intensive steps operate on the 40M-row bid-level data; these use DuckDB for out-of-core processing.