Paper ReviewMathematics & StatisticsCausal Inference

From Association to Causation in Multi-Omics: Instrumental Factor Models for Biological Discovery

Multi-omics data (genomics + proteomics + metabolomics) reveals thousands of biological associationsโ€”but associations are not causes. Mishra et al. develop instrumental factor models that use genetic variants as natural experiments to distinguish causal mechanisms from confounded correlations in high-dimensional biological data.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Modern biological research generates data at multiple molecular levels simultaneously: genomics (DNA sequence variants), transcriptomics (gene expression), proteomics (protein abundance), metabolomics (metabolite concentrations), and epigenomics (DNA methylation, histone modifications). Each level contains thousands of variables, and the interactions between levels produce a web of associations so dense that distinguishing genuine causal mechanisms from confounded correlations becomes nearly impossible through association analysis alone.

The stakes of getting this distinction right are high. A protein that is associated with a disease may be a biomarker (useful for diagnosis but not a drug target) or a cause of the disease (a viable drug target). Billions of dollars in drug development depend on correctly identifying causesโ€”and the failure rate of clinical trials (>90% in some therapeutic areas) is partly attributable to pursuing targets that were associated with disease but did not cause it.

Mishra et al. develop instrumental factor models that use the natural experimental structure embedded in multi-omics dataโ€”specifically, genetic variantsโ€”to distinguish causal from confounded relationships.

The Instrumental Variable Strategy

The fundamental insight is that genetic variants provide natural instruments for causal inference. A genetic variant (SNP) that affects protein expression provides an experiment that nature has already run: individuals who carry the variant have different protein levels not because of their health behaviors, environment, or disease status but because of the random genetic shuffle at conception.

If a genetic variant:

  • Affects protein X expression (relevance condition)
  • Has no effect on the outcome except through protein X (exclusion restriction)
  • Is not confounded with the outcome (independence conditionโ€”approximately satisfied by Mendelian randomization)
  • ...then the variant serves as an instrument that identifies the causal effect of protein X on the outcome, even in the presence of unmeasured confounders that affect both protein X and the outcome.

    The Factor Model Extension

    Standard Mendelian randomization uses one genetic variant to instrument one exposure. Multi-omics data presents a different challenge: thousands of exposures (proteins, metabolites) simultaneously influenced by thousands of genetic variants. The relationships are not one-to-one but many-to-manyโ€”a single genetic variant may affect multiple proteins (pleiotropy), and a single protein may be affected by multiple variants (polygenic architecture).

    Mishra et al.'s factor model handles this complexity by decomposing the high-dimensional multi-omics data into a smaller number of latent factorsโ€”biologically interpretable dimensions that capture the coordinated variation across molecular levels. Genetic variants instrument these latent factors rather than individual molecular species, addressing the pleiotropy problem (the instrument affects the factor, not individual molecules) and the dimensionality problem (fewer factors than molecules).

    The causal analysis then estimates the effect of each latent factor on the outcome, providing a pathway-level causal inference that is more robust and biologically interpretable than individual-molecule analysis.

    Claims and Evidence

    <
    ClaimEvidenceVerdict
    Genetic variants provide valid instruments for multi-omics causal inferenceMendelian randomization framework well-establishedโœ… Well-established
    Factor models address pleiotropy in instrumental variable analysisLatent factors aggregate pleiotropic effectsโœ… Supported (methodological)
    Pathway-level causal inference is more robust than single-molecule analysisAggregation reduces noise and pleiotropy biasโœ… Supported (theoretical argument)
    The approach identifies novel causal mechanismsApplication results reportedโš ๏ธ Promising; replication needed
    All genetic instruments satisfy the exclusion restrictionHorizontal pleiotropy violates exclusion; sensitivity analysis neededโš ๏ธ Assumption, not fact

    Open Questions

  • Horizontal pleiotropy: When a genetic variant affects the outcome through pathways other than the instrumented factor, the causal estimate is biased. Can the factor model detect and correct for horizontal pleiotropy?
  • Factor interpretation: Latent factors are mathematically defined but may not correspond to known biological pathways. How do we validate that the factors are biologically meaningful?
  • Sample size requirements: Instrumental variable methods require large samples for adequate statistical power. What sample sizes are needed for multi-omics factor-instrumented causal inference?
  • Dynamic causation: Multi-omics snapshots capture a single time point. Biological causation unfolds over time. Can the factor model be extended to longitudinal multi-omics data?
  • What This Means for Your Research

    For bioinformatics researchers, instrumental factor models provide a principled causal inference framework for multi-omics data that goes beyond the association-based approaches (GWAS, correlation networks) that currently dominate the field.

    For drug discovery, the distinction between causal and confounded associations directly determines target validity. Mishra et al.'s approach provides a statistical method for prioritizing targets based on causal evidence rather than correlation strength.

    For statisticians, the combination of factor models with instrumental variables in a high-dimensional setting presents interesting methodological challengesโ€”identification conditions, estimation consistency, and inference proceduresโ€”that extend the classical IV framework.

    References (1)

    [1] Mishra, A., Badri, M., Coler, E. et al. (2025). Moving from Association to Causation: Instrumental factor models for causal inference in high-dimensional multi-omics data. medRxiv.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’