Economics tooling 2026-04-25 14 minute read

What makes a Deluair brief defensible: the methodology bar

Every memo we ship has to clear the same bar a hostile referee at the American Economic Review would apply. The phase-based workflow, the no-mock-data rule, the replication package, the bounds analysis, and the em-dash discipline all serve a single purpose: a brief that survives the most aggressive reading a competent skeptic can give it.

The defensibility of a research brief is not a writing question. It is a workflow question. This essay walks through the six-phase research workflow that governs every Deluair engagement (planning, data, estimation, writing, audit, submission), the first-principles discipline that anchors design choices in economic logic rather than convention, the no-mock-data rule that requires every claim to cite a primary source, the replication packages that ship with pinned versions and collection logs, the hostile-referee audit conducted by a colleague who did not touch the analysis, and the identification standards (HC1 standard errors, panel clustering, first-stage F statistics, balance tables, placebo tests, Manski and Rambachan-Roth bounds) that decide whether a causal claim is published or scrapped. The em-dash discipline closes the essay: a small, visible signal that the editorial care behind the prose is not optional.

The premise: defensibility is a workflow, not a paragraph #

The fastest way to lose a board, a regulator, or a court is to publish a number that does not survive a competent reading. The fastest way to keep them is to make every number reproducible, every assumption legible, and every robustness check already done before a critic asks for it. Defensibility is not a stylistic property of the final memo. It is a property of the workflow that produced it.

Deluair runs every engagement through the same six-phase workflow we use for our own academic submissions through Delphi, our open-source research stack. The workflow is not a checklist that lives in a project plan and is forgotten. It is a sequence of gates, each with a written artifact that must clear review before the next phase begins. When a phase fails its gate, the engagement stops and re-scopes rather than papering over the failure with prose. That gate discipline is the first thing a serious reviewer looks for, and the first thing missing from most consulting work.

This brief walks through what the bar actually is. Where it comes from. What the gates look like. Why we do not use mock data. What ships in the replication package. What the hostile-referee audit covers. Which identification standards are non-negotiable. And why the em-dash rule is not about typography.

The six-phase workflow #

The phases are: planning, data, estimation, writing, audit, and submission. Each has a defined entry artifact, a defined exit artifact, and a defined review gate. The structure is borrowed from the AEA submission process and the JEP replication policy, both of which assume that a publishable empirical paper is the product of a sequenced workflow rather than a single creative act.

Planning produces a written pre-analysis plan: the question in one paragraph, the contribution in one paragraph, the identification strategy in formal notation, the sample, the primary specification, and the inference rule. The plan is timestamped and shared with the client before any data is touched. It exists so that the analysis cannot drift toward whatever result the team or the client hoped to find.

Data produces a data quality report: schema validation, missingness audit, outlier flags, reconciliation against any known totals, and a collection log that records every download, every transformation, and every dropped observation with a stated reason. The vintage of every series is pinned (for example, World Bank WDI 2024Q2, FRED pull on 2026-03-14, BACI V202601). The raw files sit untouched in a read-only directory.

Estimation produces tables and figures plus a robustness battery. We always start with a baseline OLS specification, even when a more sophisticated estimator is the eventual answer. From the baseline we add the identification strategy the question demands, then we run the bounds and placebos that test whether the result is fragile.

Writing produces the memo. Five-paragraph introduction, active voice, acronyms defined at first use, numbers below ten spelled out in prose, exhibits captioned, and (the rule that catches the most prose) no em-dashes. Tables and figures are not embedded inline. They sit on dedicated pages so the prose has to stand on its own.

Audit produces an internal review log. A colleague who did not touch the analysis runs a coverage check, a necessity check, a hostile-referee pass, and a clean-machine reproduction of the full replication package. If the package does not reproduce byte-for-byte, the engagement does not ship.

Submission produces the four delivery artifacts: the memo, the dashboard, the replication package, and the readout deck. The same convention applies whether the audience is a journal or a chief risk officer. The replication package is the artifact that lets the result outlive the engagement.

First-principles discipline: derive from economic logic, not convention #

The most common failure mode in applied work is borrowing a specification from a recent paper without re-deriving why it is the right specification for the question at hand. The result looks defensible because it cites a published precedent. It is not defensible, because the precedent answered a different question with different data and different assumptions.

Our standard is to derive the empirical specification from the economic model, not from the literature search. If the question is about pass-through, we begin from the firm's profit-maximization condition and work to a reduced form, then check whether the reduced form happens to match what previous work used. If it does, we cite. If it does not, we say why and we defend the deviation in the memo. The literature is a sanity check, not a source of authority.

The same discipline applies to functional form. A log-linear specification is not the right answer because it is the most common one. It is the right answer when the elasticity is the parameter of interest, when the data spans enough orders of magnitude that a level model misbehaves, and when the residual diagnostics support log-additive errors. We test those conditions explicitly. Where they fail, we use a level model, a Poisson pseudo-maximum-likelihood (Silva and Tenreyro 2006 for trade gravity), or a flexible Box-Cox, and we record the diagnostic that drove the choice.

The same discipline applies to standard errors. HC1 heteroskedasticity-robust errors are our default for cross-sectional regressions, following the AEA referee bar that treats classical errors as suspect by default. For panel data we cluster at the unit level (or at a higher level when treatment is assigned at that level, following Abadie, Athey, Imbens, and Wooldridge 2023), and we always state the clustering level in the table notes. Conventional rules of thumb (the famous 'cluster at the state level') are not a substitute for thinking about the design that generated the data.

The no-mock-data rule #

Every variable in every regression in every Deluair memo traces back to a primary source. We do not use synthetic data for production work. We do not impute missing series with estimated growth rates. We do not extrapolate beyond observed coverage. We do not silently fill zeros where the data is missing. If a series we needed is unavailable, we say so in the memo, and we either narrow the question or we do not ship the claim.

The primary sources we lean on most often are the World Bank WDI, FRED, IMF International Financial Statistics, BACI from CEPII, UN Comtrade, FDIC call reports through the FFIEC system, BIS Locational Banking Statistics, FAOSTAT, EIA Form 860 and Form 923, and client-supplied administrative data with documented provenance. Each pull is timestamped, vintage-pinned, and stored as the raw file in a read-only directory before any cleaning. This rule exists because the alternative (a chart that nobody can reproduce because the underlying pull was a one-off) is the single most common defect in published consulting work.

When client data is involved, we apply the same discipline. We require a written data dictionary. We document the system of record, the extraction date, the filters applied at extraction, and the row counts at every step of the pipeline. If we receive a spreadsheet that has been hand-edited, we ask for the source query that produced the unedited version. If that query is not available, we treat the spreadsheet as a derived series and we say so in the memo.

The rule has a corollary: no fabricated quotes, no invented statistics, no AI-generated tables passed off as data. Generative models are excellent at writing fluent prose around plausible numbers. The numbers are still fabricated. Every numeric claim in a Deluair memo has to map to a row in a CSV that ships with the replication package. If it does not, the claim is removed before submission.

Replication packages, pinned versions, collection logs #

The replication package is the artifact that makes a brief verifiable. It is also the artifact that makes the result useful to the client after the engagement ends. We follow the AEA Data and Code Availability Policy and the JEP replication standard, both of which require that a third party can rerun the analysis from raw data to final exhibit using only the artifacts in the package.

What ships in the package: the raw data files (or, where licensing prevents redistribution, the pull script and the collection log that documents what was pulled, when, and from where), the cleaning scripts in execution order, the analysis scripts, the table and figure generators, the LaTeX or Markdown source for the memo, the pre-analysis plan, the data quality report, and the audit log. Random seeds are set explicitly and recorded. Package versions are pinned in pyproject.toml (Python) and renv.lock (R), with the lockfile checked into the package and the exact interpreter version recorded in a README.

Reproducibility is tested on a clean machine. The auditor (a colleague who did not touch the analysis) clones the repository to a fresh container, installs the pinned environment, runs the master script, and confirms that every table and every figure regenerates byte-for-byte. Stochastic outputs are pinned by seed; outputs that depend on system locale or BLAS implementation are noted. If reproduction fails, the package is sent back to the analyst. The brief does not ship until reproduction succeeds.

Collection logs deserve their own paragraph. Every data pull, whether a scheduled API call or a one-off download, is recorded in a JSONL log: timestamp, source URL, query parameters, vintage tag, response hash, row count, and the analyst initials. The log lives in the repository and is appended on every run. It exists because the most common defect we see in inherited datasets is a series whose provenance is unknown. The collection log makes provenance permanent.

The hostile-referee audit #

Before a brief reaches the client, a colleague who did not touch the analysis runs an audit designed to simulate the most aggressive competent reviewer the work will ever face. The model is the AEA referee process, where a paper is read by two or three reviewers who are paid (in social capital, not cash) to find every reason it might be wrong. The audit translates that process into a one-week internal pass.

The audit covers five dimensions. Identification: is the strategy actually credible, what are the threats, and are they addressed in the memo. Data: could the headline result be a measurement artifact, a coverage artifact, or a sample-selection artifact. Econometrics: are the standard errors right, is the clustering level right, is the bandwidth right (for RDD), is the first-stage strong enough (for IV), is the parallel trends assumption defensible (for DiD). Results: are the magnitudes plausible against external benchmarks, and where they are not, is the difference explained. Contribution: is this differentiated from prior work, or is it restating an established result.

Pre-emptive defenses are written into the memo before submission. If the auditor flags a threat, the response is not 'remove the threat from the memo.' The response is 'add the test that addresses the threat, and if the test fails, weaken the claim.' This is the same discipline that drives a good rebuttal letter to a journal. We do it before submission because clients should not have to be the first reviewers.

The auditor also walks the replication package end to end on a clean machine. Every script runs from raw data to final exhibit. Every table and every figure regenerates. Every number cited in the memo prose maps to a cell in a regenerated table. If anything fails, the engagement does not ship until it is fixed. This is the most expensive single check in our process, and it is the one that has saved us the most embarrassment over five years of running it.

Identification standards: the non-negotiables #

Causal claims live or die on identification. The standards below are the minimum bar for any causal claim in a Deluair brief. They are not aspirational. They are the gate.

Standard errors. HC1 heteroskedasticity-robust standard errors as the cross-sectional default, following MacKinnon and White (1985) and the AEA convention. For panel data, cluster-robust standard errors at the level at which treatment varies, following Abadie, Athey, Imbens, and Wooldridge (2023). The clustering level is stated in the table notes for every regression. Where the number of clusters is small (under 30 to 40), we use wild cluster bootstrap (Cameron, Gelbach, Miller 2008) instead of the asymptotic correction.

Instrumental variables. First-stage F statistics reported with the Olea-Pflueger (2013) effective F as the weak-identification criterion, not the Stock-Yogo rule of thumb. The exclusion restriction is stated in plain language in the memo, with the most credible threats named and addressed. Anderson-Rubin confidence intervals are reported alongside conventional intervals for low first-stage F.

Difference-in-differences. Pre-trend plots and event-study coefficients with confidence bands. For staggered treatment, Callaway-Sant'Anna (2021) or de Chaisemartin-D'Haultfoeuille (2020) estimators rather than two-way fixed effects with negative weighting issues (Goodman-Bacon 2021, Sun-Abraham 2021). Rambachan and Roth (2023) sensitivity analysis for parallel-trends violations is run as a battery, not a single check, and the results are reported in the robustness section.

Regression discontinuity. Bandwidth selection via Calonico-Cattaneo-Titiunik (2014, 2020) with robust bias-corrected inference. Polynomial order pre-registered in the pre-analysis plan. Density tests at the cutoff (McCrary 2008) and balance tests on pre-treatment covariates ship as standard exhibits.

Balance tables. Every causal design ships with a balance table comparing treatment and control on observable pre-treatment characteristics. Standardized differences are reported alongside p-values, because the test most reviewers actually want is whether the difference is large, not whether the sample is large.

Placebo tests and randomization inference. Where the design supports it, the headline test is rerun with placebo treatments (units that should not be affected, dates that precede the policy, outcomes that should not move). Randomization inference is run as a permutation test where the assignment mechanism allows it. A causal estimate that does not survive a placebo battery is downgraded or removed.

Bounds analysis. Every causal claim ships with a bounds analysis appropriate to the design. Manski (1990) worst-case bounds where the assumptions are minimal. Rambachan-Roth (2023) for parallel-trends sensitivity in DiD. Oster (2019) for selection on unobservables in OLS. Bounds are not robustness theater. They are the honest answer to the question 'what is the range of the true effect, given what we are willing to assume.'

The em-dash discipline #

A small rule, made visible because it signals a larger one. There are no em-dashes in any Deluair output: no memos, no slide decks, no client emails, no public briefs. Commas, colons, semicolons, parentheses, and short separate sentences carry the same load with no loss of clarity. The rule is enforced by an automated test in the consultancy site repository (test_no_em_dashes_in_content) and by manual review at the writing and audit gates.

Why a typographic rule warrants a section in a methodology essay. The em-dash is the single most reliable visible marker of unedited prose, and (since 2023) of prose generated by a large language model and pasted in without revision. Models trained on web text emit em-dashes at roughly five to ten times the rate of careful human editorial copy, because their training corpora over-represent the punctuation. Removing the em-dash forces the writer (or the model output, after revision) to choose a real connective: a comma, a colon, a semicolon, a parenthesis, or a sentence break. Each of those choices carries a meaning the em-dash hides.

The discipline is also a useful tell for the client. A memo that has eliminated em-dashes throughout is a memo that someone read. A memo that has not is a memo that someone might not have. The visible signal is small. The thing it signals is the editorial care behind every other choice in the document.

We extend the same discipline to en-dashes used as range separators in prose (we use 'to' instead, as in '2018 to 2024'), to acronyms that are not defined at first use, to passive voice, to the phrase 'recent literature suggests,' and to charts captioned only with their variable names. Each of these is a small failure that, in aggregate, is what makes most analytical writing forgettable. The methodology bar covers the small failures because the small failures are what readers remember.

What this looks like, in the artifacts a client receives #

The discipline above produces a specific bundle of artifacts at every engagement close. The memo, typically twelve to twenty-five pages, written in the AER house style we use for our own papers, with executive summary, methods, results, robustness, limitations, and policy implications. The dashboard, built on the same Plotly and FastAPI stack that powers our public observatories, for the metrics the client will track between our involvement and theirs.

The replication package, with code, data with provenance metadata, pinned package versions, random seeds documented, the full pre-analysis plan, the data quality report, and the audit log. The readout deck, walked through in a default ninety-minute call, with two follow-up sessions scheduled at thirty and ninety days so the analysis stays useful as the client begins to act on it.

Together, the bundle is the answer to the question that started this brief: what makes a Deluair brief defensible. Not the prose. The workflow that produced the prose, the data discipline that constrained the claims, the audit that simulated the most aggressive competent reader the work will ever face, and the replication package that lets the result outlive the engagement.

Anchored in Delphi and EconAI #

The workflow above is not aspirational. It is implemented. Delphi is our open-source research stack: it manages the phase gates, the pre-analysis plan, the data collection logs, the replication package builder, and the audit checklist. We use it for our own academic submissions through to journal and for every client engagement we run. The same gate that decides whether a paper is ready for journal submission decides whether a memo is ready for a client readout.

EconAI is our causal-inference toolkit. It implements the identification standards in this essay: OLS with HC1 by default, panel FE with cluster-robust errors, IV with Olea-Pflueger F statistics, DiD with Callaway-Sant'Anna and de Chaisemartin-D'Haultfoeuille estimators, RDD with Calonico-Cattaneo-Titiunik bandwidths, double machine learning, causal forest, synthetic control with placebo permutation, shift-share, randomization inference, and bounds (Manski, Rambachan-Roth, Oster). Forty-seven Python modules, thirteen thousand five hundred lines, used in five published papers across wine, wood, beef, and food demand systems. Every estimator ships with the diagnostic tests that the methodology bar requires.

Together, Delphi and EconAI are why the bar holds. Without the tooling, the bar collapses into a wish list. With the tooling, the bar is enforced by the same scripts that produce the exhibits. A figure that does not have a caption is flagged automatically. A regression that does not report its clustering level is flagged automatically. A table cell that does not map to a regression output in the package is flagged automatically. The discipline is the workflow.

Apply the bar to your own published work #

If you have a published study, an investor letter, a regulatory filing, or a board memo that you would like put through this audit, we will run it. The methodology audit is a fixed-scope engagement: two weeks, one analyst plus one senior reviewer, a written report covering identification, data, econometrics, results, and contribution against the bar described in this brief. Where the work clears the bar, the report says so in writing, which is a useful artifact in its own right. Where it does not, the report names the specific tests we would re-run and the specific claims we would weaken or remove.

We have run this audit for asset managers checking their internal research before it ships to LPs, for trade associations checking impact studies before they go to a regulator, and for in-house economics teams checking their own work before it goes to a board. The audit is non-adversarial. It exists because every team would rather find the gap before a hostile reader does.

Reach out at /engage to scope a methodology audit on a piece of your own published work. Two weeks. Fixed scope. Written report. The same bar that gates our own submissions to journal.

Sources #

Cite this brief

@misc{hossen2026methodologybar,
  author = {Hossen, Md Deluair},
  title  = {What makes a Deluair brief defensible: the methodology bar},
  year   = {2026},
  url    = {https://deluair.com/consultancy/insights/methodology-bar},
  note   = {Deluair Consultancy briefs}
}

Hossen, M. D. (2026). What makes a Deluair brief defensible: the methodology bar. Deluair Consultancy briefs. https://deluair.com/consultancy/insights/methodology-bar

Hossen, Md Deluair. "What makes a Deluair brief defensible: the methodology bar." Deluair Consultancy briefs, 2026-04-25. https://deluair.com/consultancy/insights/methodology-bar.

Related insights

Adjacent reading.

AI for economics tooling

Building a Bloomberg-grade observatory in twelve weeks: the architecture pattern

How the deluair platform family compresses what once took a Bloomberg terminal team a year into a single quarter of focused engineering....

Read brief →