Paper ReviewMathematics & StatisticsCausal Inference

Can LLMs Discover Causes? When Language Models Meet Observational Causal Inference

Traditional causal discovery requires large datasets and strong statistical assumptions. LLMs bring a new ingredient: domain knowledge encoded in pre-training. Susanti & Färber test whether LLMs can use observational data for causal discovery, while REX integrates explainable AI with causal structure learning.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Causal discovery—inferring cause-and-effect relationships from data—has traditionally been the domain of specialized statistical algorithms: PC, GES, LiNGAM, and their many variants. These methods work directly on observational data, using conditional independence tests or structural equation models to infer causal graphs. They are mathematically principled but make strong assumptions (faithfulness, causal sufficiency, specific distributional forms) and require substantial data to achieve reliable results.

Large language models introduce something these statistical methods lack: domain knowledge. An LLM trained on scientific literature has absorbed extensive knowledge about causal relationships in specific domains—it "knows" that smoking causes cancer, that interest rates affect inflation, that gene mutations drive drug resistance. Can this knowledge be combined with observational data to improve causal discovery beyond what either statistical methods or LLM knowledge alone can achieve?

Susanti & Färber investigate this question directly, testing whether LLMs can leverage observational data for causal discovery. REX (Renero et al.) approaches from the complementary direction, using explainable AI techniques to enhance traditional causal discovery—creating a bridge between the interpretability of LLM reasoning and the rigor of statistical causal inference.

LLMs as Causal Reasoners

Susanti & Färber design a systematic evaluation of LLMs' causal discovery capability under three conditions:

Knowledge-only: The LLM is given variable names and asked to infer causal relationships based purely on its pre-trained knowledge. No data is provided. This tests the LLM's domain knowledge.

Data-only: The LLM is given observational data (correlation matrices, summary statistics, or raw data samples) without meaningful variable names. This tests the LLM's ability to perform statistical causal inference from data.

Knowledge + Data: The LLM receives both meaningful variable names and observational data. This tests the synergy between domain knowledge and statistical evidence.

The findings reveal a consistent pattern:

LLMs perform surprisingly well in the knowledge-only condition for well-studied domains—accurately identifying causal relationships that appear frequently in scientific literature
LLMs perform poorly in the data-only condition—they are not effective statistical causal inference engines
The knowledge + data condition shows modest improvement over knowledge-only—suggesting that LLMs struggle to effectively integrate statistical evidence with domain knowledge

This finding is both encouraging (LLMs encode useful causal knowledge) and sobering (they cannot replace statistical causal methods for data-driven inference). The practical implication: LLMs are useful for generating prior knowledge about causal structures, which can then be integrated with statistical methods as informative priors or structural constraints.

REX: Explainability Meets Causality

Renero et al.'s REX takes a different approach. Rather than asking LLMs to perform causal discovery, REX uses explainable AI (XAI) techniques—SHAP values, feature importance, partial dependence—to extract causal information from trained machine learning models.

The insight: a well-trained predictive model implicitly captures causal information in its learned structure. If a model accurately predicts Y from X₁, X₂, ..., Xₚ, the model's feature importances and interaction patterns reflect (approximately) the causal influences of each Xᵢ on Y. XAI techniques make these implicit causal patterns explicit—extractable as a causal graph.

REX integrates multiple XAI methods to produce robust causal estimates:

SHAP values identify which features influence predictions (candidate causes)
Partial dependence plots reveal the direction of influence (positive/negative)
Feature interaction effects identify causal mediation and moderation

The combination of multiple XAI signals, aggregated through a consensus mechanism, produces causal graphs that are more robust than any single XAI method alone.

Claims and Evidence

Claim	Evidence	Verdict
LLMs encode useful causal domain knowledge	Susanti & Färber: good performance on knowledge-only condition	✅ Supported
LLMs can perform statistical causal inference from data	Poor performance on data-only condition	❌ Not supported
XAI techniques extract causal information from ML models	REX demonstrates on standard causal benchmarks	✅ Supported
LLM + data integration improves over LLM knowledge alone	Modest improvement documented	⚠️ Limited improvement
These methods match dedicated causal discovery algorithms	Performance gaps remain on challenging benchmarks	⚠️ Complementary, not replacement

Open Questions

Domain specificity: LLMs' causal knowledge reflects their training data. For novel or under-studied causal relationships (new drug interactions, emerging economic mechanisms), LLM knowledge may be absent or incorrect. How do we identify the boundaries of LLM causal knowledge?

Hallucinated causation: LLMs may assert causal relationships that are plausible but incorrect—confusing correlation patterns in their training data with genuine causation. How do we distinguish genuine causal knowledge from hallucinated causation?

Integration framework: What is the optimal way to combine LLM causal priors with statistical causal methods? Bayesian frameworks that use LLM outputs as informative priors are promising but require careful calibration of prior strength.

Causal XAI validity: Under what conditions do XAI-derived causal estimates match true causal effects? The relationship between predictive feature importance and causal influence is complex and not always positive.

What This Means for Your Research

For causal inference researchers, LLMs and XAI provide complementary information sources for causal discovery. LLMs contribute domain knowledge; XAI contributes data-driven pattern extraction. Neither replaces dedicated causal methods, but both can augment them—particularly in the common scenario where domain knowledge is available but incomplete.

For ML practitioners who use predictive models, REX demonstrates that your trained models contain causal information that can be extracted. This is valuable even when the primary goal was prediction: understanding why the model makes its predictions is both scientifically informative and practically useful for model debugging.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 특정 연구 결과, 통계 및 주장은 학술 작업에서 인용하기 전에 원본 논문을 통해 반드시 검증해야 한다.

LLM은 원인을 발견할 수 있는가? 언어 모델과 관측적 인과 추론의 만남

인과 발견(causal discovery)—데이터로부터 인과관계를 추론하는 것—은 전통적으로 PC, GES, LiNGAM 및 그 다양한 변형들과 같은 특수 통계 알고리즘의 영역이었다. 이러한 방법들은 관측 데이터에 직접 작동하며, 조건부 독립성 검정이나 구조 방정식 모델(structural equation model)을 사용하여 인과 그래프를 추론한다. 이 방법들은 수학적으로 정교하지만 강한 가정(충실성(faithfulness), 인과 충분성(causal sufficiency), 특정 분포 형태)을 전제로 하며, 신뢰할 수 있는 결과를 얻기 위해 상당한 양의 데이터를 필요로 한다.

대규모 언어 모델(Large Language Model, LLM)은 이러한 통계적 방법들이 갖추지 못한 것을 도입한다: 바로 도메인 지식이다. 과학 문헌으로 학습된 LLM은 특정 도메인의 인과관계에 관한 광범위한 지식을 습득한다—흡연이 암을 유발한다는 것, 금리가 인플레이션에 영향을 미친다는 것, 유전자 돌연변이가 약물 내성을 유도한다는 것을 "알고" 있다. 이러한 지식을 관측 데이터와 결합하면, 통계적 방법이나 LLM 지식 단독으로 달성할 수 있는 것을 넘어서는 인과 발견이 가능할까?

Susanti & Färber는 LLM이 인과 발견을 위해 관측 데이터를 활용할 수 있는지를 직접 검토함으로써 이 문제를 탐구한다. REX(Renero et al.)는 보완적인 방향에서 접근하여, 설명 가능한 AI(explainable AI, XAI) 기법을 사용해 전통적인 인과 발견을 강화한다—이를 통해 LLM 추론의 해석 가능성과 통계적 인과 추론의 엄밀성 사이에 다리를 놓는다.

인과 추론자로서의 LLM

Susanti & Färber는 세 가지 조건 하에서 LLM의 인과 발견 능력을 체계적으로 평가하는 실험을 설계한다:

지식만 사용(Knowledge-only): LLM에 변수 이름을 제공하고, 사전 학습된 지식만을 바탕으로 인과관계를 추론하도록 요청한다. 데이터는 제공되지 않는다. 이는 LLM의 도메인 지식을 검증한다.

데이터만 사용(Data-only): LLM에 의미 있는 변수 이름 없이 관측 데이터(상관 행렬, 요약 통계, 또는 원시 데이터 샘플)를 제공한다. 이는 LLM이 데이터로부터 통계적 인과 추론을 수행하는 능력을 검증한다.

지식 + 데이터(Knowledge + Data): LLM에 의미 있는 변수 이름과 관측 데이터를 모두 제공한다. 이는 도메인 지식과 통계적 증거 사이의 시너지를 검증한다.

연구 결과는 일관된 패턴을 드러낸다:

LLM은 지식만 사용 조건에서 잘 연구된 도메인에 대해 놀라울 정도로 우수한 성능을 보인다—과학 문헌에 자주 등장하는 인과관계를 정확히 식별한다
LLM은 데이터만 사용 조건에서 저조한 성능을 보인다—LLM은 효과적인 통계적 인과 추론 엔진이 아니다
지식 + 데이터 조건은 지식만 사용 조건 대비 소폭의 개선을 보인다—LLM이 통계적 증거를 도메인 지식과 효과적으로 통합하는 데 어려움을 겪는다는 것을 시사한다

이 연구 결과는 고무적인 동시에(LLM이 유용한 인과 지식을 인코딩한다) 냉정하기도 하다(LLM은 데이터 기반 추론에서 통계적 인과 방법을 대체할 수 없다). 실용적 함의: LLM은 인과 구조에 대한 사전(prior) 지식을 생성하는 데 유용하며, 이는 정보적 사전 확률(informative prior)이나 구조적 제약(structural constraint)으로서 통계적 방법에 통합될 수 있다.

REX: 설명 가능성과 인과성의 만남

Renero et al.의 REX는 다른 접근 방식을 취한다. LLM에게 인과 발견을 수행하도록 요청하는 대신, REX는 설명 가능한 AI(XAI) 기법—SHAP 값, 특징 중요도(feature importance), 부분 의존도(partial dependence)—을 사용하여 학습된 머신러닝 모델로부터 인과 정보를 추출한다.

핵심 통찰: 잘 학습된 예측 모델은 그 학습된 구조 안에 인과 정보를 암묵적으로 포착한다. 모델이 X₁, X₂, ..., Xₚ로부터 Y를 정확하게 예측한다면, 모델의 특징 중요도와 상호작용 패턴은 각 Xᵢ가 Y에 미치는 인과적 영향을 (근사적으로) 반영한다. XAI 기법은 이러한 암묵적 인과 패턴을 명시적으로 만든다—인과 그래프로 추출 가능한 형태로. REX는 강건한 인과 추정치를 생성하기 위해 여러 XAI 방법을 통합한다:

SHAP 값은 어떤 특성이 예측에 영향을 미치는지 식별한다 (후보 원인)
부분 의존성 플롯(partial dependence plot)은 영향의 방향을 드러낸다 (양/음)
특성 상호작용 효과는 인과 매개(causal mediation)와 조절(moderation)을 식별한다

합의 메커니즘을 통해 집계된 다중 XAI 신호의 조합은 단일 XAI 방법만으로 생성된 것보다 더 강건한 인과 그래프를 생성한다.

주장과 근거

주장	근거	판정
LLM은 유용한 인과적 도메인 지식을 인코딩한다	Susanti & Färber: 지식만 조건에서 우수한 성능	✅ 지지됨
LLM은 데이터로부터 통계적 인과 추론을 수행할 수 있다	데이터만 조건에서 낮은 성능	❌ 지지되지 않음
XAI 기법은 ML 모델에서 인과 정보를 추출한다	REX가 표준 인과 벤치마크에서 입증	✅ 지지됨
LLM + 데이터 통합이 LLM 지식만 사용하는 것보다 개선된다	소폭 개선이 문서화됨	⚠️ 제한적 개선
이러한 방법들이 전용 인과 탐색 알고리즘에 필적한다	도전적인 벤치마크에서 성능 격차가 존재	⚠️ 대체재가 아닌 보완재

미해결 과제

도메인 특수성: LLM의 인과적 지식은 훈련 데이터를 반영한다. 새로운 약물 상호작용, 부상하는 경제 메커니즘 등 새롭거나 연구가 부족한 인과 관계에 대해서는 LLM의 지식이 없거나 부정확할 수 있다. LLM 인과 지식의 경계를 어떻게 식별할 것인가?

환각된 인과관계: LLM은 그럴듯하지만 부정확한 인과 관계를 단언할 수 있으며, 훈련 데이터의 상관 패턴을 진정한 인과관계와 혼동할 수 있다. 진정한 인과적 지식과 환각된 인과관계를 어떻게 구별할 것인가?

통합 프레임워크: LLM의 인과적 사전 정보(causal prior)를 통계적 인과 방법과 결합하는 최적의 방법은 무엇인가? LLM 출력을 정보적 사전 분포(informative prior)로 활용하는 베이지안 프레임워크는 유망하지만 사전 분포 강도의 신중한 보정이 필요하다.

인과적 XAI 타당성: 어떤 조건에서 XAI로 도출된 인과 추정치가 실제 인과 효과와 일치하는가? 예측적 특성 중요도와 인과적 영향 사이의 관계는 복잡하며 항상 양의 관계에 있지 않다.

연구에 주는 시사점

인과 추론 연구자들에게 있어 LLM과 XAI는 인과 탐색을 위한 상호 보완적인 정보 출처를 제공한다. LLM은 도메인 지식을 제공하고, XAI는 데이터 기반 패턴 추출을 제공한다. 어느 쪽도 전용 인과 방법을 대체하지는 않지만, 특히 도메인 지식이 존재하지만 불완전한 일반적인 시나리오에서 두 방법 모두 이를 보강할 수 있다.

예측 모델을 사용하는 ML 실무자들에게 REX는 훈련된 모델에 추출 가능한 인과 정보가 내포되어 있음을 보여준다. 이는 주된 목표가 예측이었던 경우에도 가치 있다: 모델이 왜 예측을 하는지 이해하는 것은 과학적으로 정보를 제공하며 모델 디버깅에도 실용적으로 유용하다.

References (2)

[1] Susanti, Y. & Färber, M. (2025). Can LLMs Leverage Observational Data? Towards Data-Driven Causal Discovery with LLMs. arXiv:2504.10936.

DOI Scholar

[2] Renero, J., Ochoa, I., Maestre, R. (2025). REX: Causal Discovery based on Machine Learning and Explainability techniques. Pattern Recognition.