Methodology GuideMathematics & StatisticsCausal Inference

Adjusting for What You Cannot See: High-Dimensional Confounding in Causal Inference

When potential confounders outnumber observations—common in genomics, EHR data, and social media studies—standard causal adjustment fails. Cha et al. and Kong develop debiased estimators that provide valid causal inference in the high-dimensional regime where classical methods break down.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The fundamental challenge of observational causal inference is confounding: variables that influence both the treatment and the outcome, creating spurious associations that mimic causal effects. The standard solution—adjusting for confounders through regression, matching, or weighting—works when confounders are few and measured. It fails when confounders are high-dimensional: hundreds or thousands of potential confounders, many of which may be irrelevant but cannot be safely ignored.

This high-dimensional setting is increasingly common:

  • Genomics: Thousands of gene expression measurements, any subset of which might confound the treatment-outcome relationship
  • Electronic health records: Thousands of diagnosis codes, procedures, medications, and lab values
  • Social media studies: Thousands of behavioral features (posting frequency, network metrics, content topics)
  • Economics: Hundreds of regional, demographic, and economic indicators in cross-sectional studies
In these settings, standard regression cannot estimate causal effects because the number of confounders exceeds the sample size—the model is unidentified. Regularized regression (Lasso, Ridge) can fit the data but produces biased causal estimates because regularization introduces systematic bias toward zero.

Cha et al. and Kong develop debiased estimators that correct for the regularization bias, providing valid causal inference in the high-dimensional regime.

The Debiasing Strategy

The debiasing approach proceeds in two stages:

Stage 1: Regularized estimation. Fit high-dimensional models for the outcome (outcome ~ treatment + confounders) and the treatment assignment (treatment ~ confounders) using Lasso or similar regularized methods. These models are biased—they underestimate some coefficients and set others incorrectly to zero—but they provide reasonable approximations of the nuisance functions.

Stage 2: Bias correction. Construct a correction term that accounts for the regularization bias in the treatment effect estimate. The correction uses the residuals from Stage 1—the parts of the outcome and treatment that the regularized models could not explain—to remove the systematic bias.

The result is an estimator that converges to the true causal effect at the standard √n rate, even though the individual nuisance models converge more slowly due to high dimensionality. This "rate double robustness" is the key theoretical property: the causal estimate achieves parametric efficiency despite the non-parametric complexity of the nuisance estimation.

Binary Outcomes: The GLM Extension

Kong extends the framework to binary outcomes—the setting most common in medicine (disease yes/no), economics (employment yes/no), and social science (behavior yes/no). Binary outcomes require generalized linear models (logistic regression, probit), where the debiasing strategy must account for the nonlinear link function.

The technical contribution is a debiased estimator for the average treatment effect in high-dimensional logistic regression that:

  • Handles general link functions (not just logistic)
  • Achieves √n convergence under standard sparsity assumptions
  • Provides asymptotically valid confidence intervals

Claims and Evidence

<
ClaimEvidenceVerdict
Standard regression fails for high-dimensional causal inferenceBias from regularization is well-documented✅ Well-established
Debiased estimators restore valid causal inferenceCha et al. and Kong prove √n convergence✅ Proven
The approach handles binary outcomes via GLM extensionKong: debiased estimator for general link functions✅ Proven
Sparsity assumptions are necessaryRequired for regularized estimation to succeed✅ Standard assumption
Real-world confounders satisfy sparsityApproximate sparsity is plausible; exact sparsity is strong⚠️ Approximately valid in many settings

Open Questions

  • Model misspecification: Debiased estimators assume the outcome and treatment models are correctly specified (up to sparsity). What happens when both models are misspecified?
  • Heterogeneous effects: The focus is on average treatment effects. Extending debiased estimation to conditional (heterogeneous) treatment effects in high dimensions adds substantial complexity.
  • Practical tuning: Lasso requires choosing a regularization parameter λ. In the causal context, the optimal λ for prediction differs from the optimal λ for causal estimation. How should λ be selected for causal purposes?
  • Multiple treatments: Extending from binary treatment (treated/control) to multiple treatments or continuous treatments in high dimensions requires additional methodological development.
  • What This Means for Your Research

    For applied researchers in biomedicine, economics, and social science who work with high-dimensional observational data, debiased estimation provides the methodological foundation for credible causal claims. The key message: regularized regression alone is insufficient for causal inference—the debiasing step is essential for removing the systematic bias that regularization introduces.

    For statisticians, the debiasing framework is an active and productive research area where theoretical advances have immediate practical impact. The extension to non-standard settings (GLMs, survival analysis, longitudinal data, interference) provides ample opportunity for contribution.

    References (2)

    [1] Cha, S., Song, J., Lee, K. (2025). High-dimensional confounding adjustment in causal inference. Statistical Papers.
    [2] Kong, J. (2025). Causal Inference in High-Dimensional Generalized Linear Models with Binary Outcomes. Semantic Scholar.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords →