Critical ReviewLinguistics & NLP

Can We Read a Transformer's Mind? Linguistic Interpretability in LLMs

Do LLMs actually learn grammar, or do they approximate it? The growing field of linguistic interpretability uses probing classifiers, minimal pairs, and causal analysis to examine what transformers encode about syntax. The findings are both encouraging and cautionary.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

When a large language model correctly handles subject-verb agreement across intervening noun phrases, is it applying something like a grammatical rule, or performing a sophisticated pattern match? This question has become one of the more actively researched areas at the intersection of NLP and linguistics. If transformers genuinely internalize syntactic representations, this has implications for both linguistic theory (what can be learned from distributional data?) and practical NLP (can we make models more reliable by understanding their internal representations?).

The Research Landscape: A Rapidly Growing Field

The scope of this literature is now substantial. Graichen, de-Dios-Flores, and Boleda (2026) present the most comprehensive survey to date: a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models, reporting on over 1,015 model results across a range of syntactic phenomena and interpretability methods. Their analysis reveals a field that is methodologically diverse but converging on some findings.

López-Otal, Gracia, and Bernad (2025) provide a complementary systematic review, focusing specifically on the linguistic interpretability of transformer architectures. They organize methods into three families that have become standard in the field:

Probing classifiers: Train lightweight models on a network's internal representations to test whether specific linguistic properties (part of speech, dependency relations, semantic roles) are linearly decodable. If a simple classifier can extract syntactic information from a hidden layer, that information is represented there in some accessible form.

Behavioral testing: Present models with carefully constructed minimal pairs and measure whether the model assigns higher probability to the grammatical variant. This treats the model as a subject in a linguistic experiment.

Causal/interventional methods: Actively modify internal representations and measure downstream effects. Rather than asking what information is present, these methods ask what information is used.

The Probing Debate

He, Chen, and Nie (2024) introduce an approach they call "decoding probing," inspired by cognitive neuroscience methods. Using the BLiMP benchmark of minimal pairs, they probe internal linguistic characteristics layer by layer, treating the language model as analogous to a brain and its representations as "neural activations."

Their key insight is methodological: rather than training probes on arbitrary linguistic annotations, they use minimal pairs to create a more naturalistic probing setup. The model's internal states are evaluated based on whether they distinguish grammatical from ungrammatical sentences at each layer. This approach reveals a consistent pattern across models: sensitivity to syntactic distinctions emerges in middle layers and peaks before declining in later layers, suggesting that syntactic processing occurs in intermediate representations rather than at the input or output level.

A Challenge to the Probing Paradigm

Agarwal and Manning (2025), with 5 citations, raise a significant challenge. Their paper title captures the argument: "Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations." The finding is that there is a weaker-than-expected correlation between a model's probing results (how much syntactic information is linearly decodable from its representations) and its behavioral performance (how well it handles syntactic phenomena in practice).

This suggests that probing may overestimate what models "know" about syntax. A model might encode syntactic information in its representations—in the sense that a probe can extract it—without actually using that information for syntactic processing. The analogy to neuroscience is apt: the fact that syntactic information can be decoded from brain activity does not necessarily mean the brain uses that information for syntactic processing in the way we might assume.

The practical implication is that probing results should be interpreted with caution. Finding syntactic structure in a model's representations is necessary but not sufficient evidence that the model processes syntax in a linguistically meaningful way.

Multi-Word Verb Representations

Kissane and Krauss (2025), with 5 citations, examine a specific linguistic phenomenon: verb-particle combinations (like "look up," "turn down," "break out"). These multi-word verbs are linguistically interesting because their meaning is often non-compositional (the meaning of "look up" is not predictable from "look" + "up"), and their syntactic behavior is complex (the particle can appear in different positions depending on the object).

Their probing study on BERT reveals that lower layers encode primarily lexical properties of verbs and particles, while upper layers encode the syntactic relationships between them. Interestingly, the representations of idiomatic multi-word verbs (where the meaning is non-compositional) show different layer-wise patterns from compositional ones—suggesting that models develop distinct representational strategies for different types of verb-particle combinations.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Syntactic information is encoded in transformer hidden layers	Multiple probing studies across model architectures	✅ Supported — consistently replicated
Syntactic sensitivity peaks in middle layers	He et al.'s layer-by-layer minimal pairs probing	✅ Supported
Probing results predict behavioral syntactic performance	Agarwal et al.'s correlation analysis	❌ Refuted — correlation is weaker than expected
Models distinguish compositional from idiomatic multi-word verbs	Kissane et al.'s BERT probing study	✅ Supported — different layer-wise activation patterns

The Interpretation Gap

What emerges from these papers is a nuanced picture. Syntactic information is clearly present in transformer representations—this is now well-established. But the relationship between having this information and using it is less clear. As Agarwal et al. demonstrate, representational capacity and functional use can come apart. This distinction matters both theoretically (for understanding what models learn about language) and practically (for building more robust NLP systems).

Graichen et al.'s comprehensive review adds a further dimension: the field shows "a healthy variety of methods" but also considerable fragmentation. Different studies use different models, different probing methods, and different definitions of "syntactic knowledge," making cross-study comparison difficult.

Open Questions and Future Directions

From representation to mechanism: The field needs to move beyond asking "is syntactic information present?" to asking "how is it used in processing?" Causal/interventional methods are promising but still methodologically challenging.

Cross-linguistic coverage: The vast majority of interpretability studies use English. Extending to morphologically rich languages could reveal whether transformers develop fundamentally different representational strategies.

Scale effects: Do larger models develop qualitatively different representations, or merely sharper versions of the same patterns? Early evidence suggests both.

Relationship to human processing: The layer-wise emergence of syntactic sensitivity in transformers bears suggestive parallels to staged processing in the human brain. How seriously should we take these parallels?

Evaluation methodology: The field needs standardized evaluation protocols. Graichen et al.'s review of 337 papers reveals significant methodological heterogeneity that limits generalizability.

What This Means for Your Research

For NLP researchers, Agarwal et al.'s finding is practically important: probing is a useful diagnostic, but it should not be equated with functional understanding. If you want to know whether a model reliably processes syntax, behavioral testing is a more direct measure.

For theoretical linguists, the interpretability literature offers a new kind of evidence about what is learnable from distributional data. The consistent finding that syntactic information emerges in intermediate layers is compatible with the view that syntax occupies a middle level of linguistic representation—more abstract than surface form, less abstract than meaning.

Explore related research through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 특정 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

트랜스포머의 마음을 읽을 수 있는가? LLM의 언어학적 해석 가능성

대규모 언어 모델이 개입하는 명사구를 가로질러 주어-동사 일치를 올바르게 처리할 때, 이것은 문법 규칙과 유사한 무언가를 적용하는 것인가, 아니면 정교한 패턴 매칭을 수행하는 것인가? 이 질문은 NLP와 언어학의 교차점에서 가장 활발하게 연구되는 분야 중 하나가 되었다. 트랜스포머가 진정으로 통사적 표상을 내면화한다면, 이는 언어 이론(분포 데이터로부터 무엇을 학습할 수 있는가?)과 실용적 NLP(내부 표상을 이해함으로써 모델을 더 신뢰할 수 있게 만들 수 있는가?) 모두에 시사점을 가진다.

연구 지형: 빠르게 성장하는 분야

이 분야의 문헌 범위는 이제 상당히 방대하다. Graichen, de-Dios-Flores, Boleda(2026)는 현재까지 가장 포괄적인 서베이를 제시한다: 트랜스포머 기반 언어 모델의 통사적 능력을 평가하는 337편의 논문에 대한 체계적 리뷰로, 다양한 통사적 현상과 해석 가능성 방법에 걸쳐 1,015개 이상의 모델 결과를 보고한다. 그들의 분석은 방법론적으로 다양하지만 일부 연구 결과에 수렴하고 있는 분야를 드러낸다.

López-Otal, Gracia, Bernad(2025)는 트랜스포머 아키텍처의 언어학적 해석 가능성에 특히 초점을 맞춘 보완적인 체계적 리뷰를 제공한다. 이들은 해당 분야에서 표준이 된 세 가지 방법론 계열로 방법들을 정리한다:

프로빙 분류기(Probing classifiers): 네트워크의 내부 표상에 경량 모델을 훈련시켜 특정 언어적 속성(품사, 의존 관계, 의미역)이 선형적으로 디코딩 가능한지 검증한다. 단순한 분류기가 은닉 층으로부터 통사 정보를 추출할 수 있다면, 그 정보는 접근 가능한 형태로 해당 위치에 표상되어 있는 것이다.

행동 검증(Behavioral testing): 신중하게 구성된 최소 대립쌍을 모델에 제시하고, 모델이 문법적 변이형에 더 높은 확률을 할당하는지 측정한다. 이는 모델을 언어 실험의 피험자로 취급하는 방식이다.

인과적/개입적 방법(Causal/interventional methods): 내부 표상을 능동적으로 수정하고 하류 효과를 측정한다. 어떤 정보가 존재하는지 묻는 대신, 어떤 정보가 사용되는지 묻는다.

프로빙 논쟁

He, Chen, Nie(2024)는 인지신경과학 방법에서 영감을 받아 "디코딩 프로빙(decoding probing)"이라 부르는 접근법을 소개한다. 최소 대립쌍의 BLiMP 벤치마크를 활용하여, 언어 모델을 뇌에, 그 표상을 "신경 활성화"에 유비함으로써 층별로 내부 언어적 특성을 프로빙한다.

이들의 핵심 통찰은 방법론적인 것이다: 임의적인 언어 주석에 프로브를 훈련시키는 대신, 최소 대립쌍을 사용하여 보다 자연스러운 프로빙 환경을 구성한다. 모델의 내부 상태는 각 층에서 문법적 문장과 비문법적 문장을 구별하는지 여부에 따라 평가된다. 이 접근법은 모델 전반에 걸쳐 일관된 패턴을 드러낸다: 통사적 구별에 대한 민감성은 중간 층에서 나타나 이후 층에서 감소하기 전 정점에 달하며, 이는 통사 처리가 입력 또는 출력 수준이 아닌 중간 표상에서 일어남을 시사한다.

프로빙 패러다임에 대한 도전

Agarwal과 Manning(2025)은 5회 인용으로 중요한 도전을 제기한다. 논문 제목이 그 논지를 잘 포착한다: "통사를 위한 프로빙은 목표 통사 평가에서의 성능을 설명하지 못한다(Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations)." 연구 결과는 모델의 프로빙 결과(표상으로부터 선형적으로 디코딩 가능한 통사 정보의 양)와 행동 성능(실제로 통사적 현상을 얼마나 잘 처리하는지) 사이의 상관관계가 예상보다 약하다는 것이다. 이는 프로빙(probing)이 모델이 통사론에 대해 "알고 있는" 것을 과대평가할 수 있음을 시사한다. 모델은 자신의 표현(representations)에 통사적 정보를 인코딩할 수 있다—프로브(probe)가 이를 추출할 수 있다는 의미에서—그러나 실제로는 통사적 처리에 그 정보를 사용하지 않을 수 있다. 신경과학과의 유비(analogy)는 적절하다: 뇌 활동에서 통사적 정보가 디코딩될 수 있다는 사실이, 뇌가 우리가 가정하는 방식으로 통사적 처리에 그 정보를 사용한다는 것을 반드시 의미하지는 않는다.

실제적인 함의는 프로빙 결과를 신중하게 해석해야 한다는 것이다. 모델의 표현에서 통사 구조를 발견하는 것은 필요조건이지만, 모델이 언어학적으로 의미 있는 방식으로 통사론을 처리한다는 충분한 증거는 아니다.

다중 어휘 동사 표현

Kissane and Krauss (2025)는 5회 인용으로, 특정 언어적 현상인 동사-불변화사 결합(verb-particle combinations, 예: "look up," "turn down," "break out")을 연구한다. 이러한 다중 어휘 동사(multi-word verbs)는 언어학적으로 흥미로운데, 그 의미가 흔히 비합성적(non-compositional)이고("look up"의 의미는 "look"과 "up"으로부터 예측 불가능하다), 통사적 행동이 복잡하기 때문이다(불변화사는 목적어에 따라 다른 위치에 나타날 수 있다).

BERT에 대한 그들의 프로빙 연구는 하위 레이어가 주로 동사와 불변화사의 어휘적 특성을 인코딩하는 반면, 상위 레이어는 그들 사이의 통사적 관계를 인코딩한다는 것을 보여준다. 흥미롭게도, 관용적 다중 어휘 동사(의미가 비합성적인 경우)의 표현은 합성적 표현과 레이어별로 다른 패턴을 보인다—이는 모델이 서로 다른 유형의 동사-불변화사 결합에 대해 구별된 표현 전략을 발달시킴을 시사한다.

비판적 분석: 주장과 증거

주장	증거	평결
통사적 정보가 트랜스포머(transformer) 은닉 레이어에 인코딩된다	여러 모델 아키텍처에 걸친 다수의 프로빙 연구	✅ 지지됨 — 일관되게 재현됨
통사적 민감도는 중간 레이어에서 정점에 달한다	He et al.의 레이어별 최소 쌍(minimal pairs) 프로빙	✅ 지지됨
프로빙 결과가 행동적 통사 수행을 예측한다	Agarwal et al.의 상관관계 분석	❌ 반박됨 — 상관관계가 예상보다 약함
모델이 합성적 다중 어휘 동사와 관용적 다중 어휘 동사를 구별한다	Kissane et al.의 BERT 프로빙 연구	✅ 지지됨 — 레이어별 활성화 패턴이 상이함

해석의 간극

이 논문들로부터 드러나는 것은 미묘한 그림이다. 통사적 정보가 트랜스포머 표현에 명확히 존재한다는 것은 이제 잘 확립되어 있다. 그러나 이 정보를 보유하는 것과 사용하는 것 사이의 관계는 덜 명확하다. Agarwal et al.이 보여주듯, 표현적 용량(representational capacity)과 기능적 사용은 분리될 수 있다. 이 구분은 이론적으로(모델이 언어에 대해 무엇을 학습하는지 이해하기 위해)와 실제적으로(더 강건한 NLP 시스템 구축을 위해) 모두 중요하다.

Graichen et al.의 포괄적 리뷰는 한 가지 차원을 더한다: 이 분야는 "건전한 방법론적 다양성"을 보이지만 상당한 분산화도 나타난다. 서로 다른 연구들이 다른 모델, 다른 프로빙 방법, 그리고 "통사적 지식"에 대한 다른 정의를 사용하여 연구 간 비교를 어렵게 만든다.

미해결 문제 및 향후 연구 방향

표현에서 메커니즘으로: 이 분야는 "통사적 정보가 존재하는가?"라는 질문을 넘어 "처리 과정에서 어떻게 사용되는가?"라는 질문으로 나아가야 한다. 인과적/개입적(causal/interventional) 방법은 유망하지만 방법론적으로 여전히 도전적이다.

교차언어적 적용 범위: 해석 가능성(interpretability) 연구의 대다수는 영어를 사용한다. 형태론적으로 풍부한 언어로 확장하면 트랜스포머가 근본적으로 다른 표현 전략을 발달시키는지 여부를 밝힐 수 있을 것이다.

규모 효과: 더 큰 모델은 질적으로 다른 표현을 발달시키는가, 아니면 단순히 동일한 패턴의 더 선명한 버전을 발달시키는가? 초기 증거는 양쪽 모두를 시사한다.

인간 처리 방식과의 관계: 트랜스포머에서 나타나는 통사적 민감성의 층별 출현은 인간 뇌의 단계적 처리 방식과 시사적인 유사성을 보인다. 이러한 유사성을 얼마나 진지하게 받아들여야 하는가?

평가 방법론: 이 분야에는 표준화된 평가 프로토콜이 필요하다. Graichen et al.의 337편 논문 검토는 일반화 가능성을 제한하는 상당한 방법론적 이질성을 드러낸다.

연구에 대한 시사점

NLP 연구자들에게 Agarwal et al.의 발견은 실질적으로 중요하다: 프로빙(probing)은 유용한 진단 도구이지만, 기능적 이해와 동일시해서는 안 된다. 모델이 통사론을 안정적으로 처리하는지 알고 싶다면, 행동 테스트가 보다 직접적인 측정 방법이다.

이론 언어학자들에게 해석 가능성 문헌은 분포 데이터로부터 무엇이 학습 가능한지에 관한 새로운 종류의 증거를 제공한다. 통사적 정보가 중간 층에서 출현한다는 일관된 발견은, 통사론이 표층 형식보다는 추상적이고 의미보다는 덜 추상적인 언어 표상의 중간 수준을 차지한다는 관점과 양립 가능하다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (6)

[1] López-Otal, M., Gracia, J., & Bernad, J. (2025). Linguistic Interpretability of Transformer-based Language Models: a systematic review. arXiv:2504.08001.

DOI Scholar

[2] He, L., Chen, P., & Nie, E. (2024). Decoding Probing: Revealing Internal Linguistic Structures in Neural Language Models Using Minimal Pairs. arXiv:2403.17299.

DOI Scholar

[3] Graichen, N., de-Dios-Flores, I., & Boleda, G. (2026). The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models. arXiv:2601.19926.

DOI Scholar

[4] Agarwal, A., Jian, J., & Manning, C.D. (2025). Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations. arXiv:2506.16678.

DOI Scholar

[5] Kissane, H., Schilling, A., & Krauss, P. (2025). Probing Internal Representations of Multi-Word Verbs in Large Language Models. arXiv:2502.04789.

DOI Scholar

Agarwal et al. (2025). Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations.

Scholar

Can We Read a Transformer's Mind? Linguistic Interpretability in LLMs

The Research Landscape: A Rapidly Growing Field

The Probing Debate

A Challenge to the Probing Paradigm

Multi-Word Verb Representations

Critical Analysis: Claims and Evidence

The Interpretation Gap

Open Questions and Future Directions

What This Means for Your Research

트랜스포머의 마음을 읽을 수 있는가? LLM의 언어학적 해석 가능성

연구 지형: 빠르게 성장하는 분야

프로빙 논쟁

프로빙 패러다임에 대한 도전

다중 어휘 동사 표현

비판적 분석: 주장과 증거

해석의 간극

미해결 문제 및 향후 연구 방향

연구에 대한 시사점

References (6)

Explore this topic deeper