Trend AnalysisLinguistics & NLP

Large Language Models and Linguistic Competence: Can Statistical Machines Truly Understand Language?

Do large language models possess genuine linguistic competence, or merely simulate it through statistical pattern matching? Recent benchmarks and probing studies are bringing new empirical precision to this debate.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The question of whether large language models (LLMs) possess genuine linguistic competence or merely approximate it through distributional statistics has become one of the most active debates in both computational and theoretical linguistics. Chomsky's distinction between competence (tacit knowledge of linguistic rules) and performance (actual language use) provides the traditional framing, but LLMs challenge this dichotomy: they achieve remarkable performance without any obvious rule-based competence. Recent work is moving beyond philosophical arguments toward empirical measurement of what LLMs do and do not represent internally.

Why It Matters

The stakes extend beyond academic linguistics. If LLMs genuinely acquire linguistic knowledge, this would suggest that distributional learning from text alone is sufficient to recover the structure of human language, a conclusion with profound implications for theories of language acquisition and the nature of linguistic knowledge. If they merely simulate competence through surface-level pattern matching, then their impressive performance masks fundamental limitations that will surface in safety-critical applications, education technology, and legal contexts where genuine understanding matters.

The practical dimension is equally pressing. As LLMs are deployed in translation, content generation, legal analysis, and medical communication, understanding whether they grasp linguistic structure or merely correlate with it determines how much we can trust their outputs in contexts that require genuine comprehension.

The Science

Benchmarking Linguistic Competence

Waldis et al. (2024) introduce Holmes, a benchmark specifically designed to assess LMs' linguistic competence through classifier-based probing of internal representations. Unlike standard NLP benchmarks that test task performance, Holmes examines whether models encode distinct linguistic phenomena in their internal states. The benchmark covers phenomena including part-of-speech, syntactic dependencies, semantic roles, and discourse relations. Their key finding is that different layers of language models encode different linguistic levels, with syntax tending to concentrate in middle layers and semantics in later layers. This layered encoding suggests that something more structured than flat pattern matching is occurring, but whether it constitutes competence in the Chomskyan sense remains debatable.

Interpreting Internal Mechanisms

Jing et al. (2025) take a complementary approach with LinguaLens, using sparse auto-encoders to interpret how LLMs internally process linguistic phenomena like reference disambiguation and metaphor recognition. Their method identifies interpretable features within the model's hidden states that correspond to specific linguistic operations. The results reveal that LLMs develop specialized internal circuits for different linguistic tasks. Metaphor processing, for instance, involves distinct feature combinations from literal language processing. This internal differentiation is noteworthy because it emerges without explicit linguistic training, suggesting that distributional learning does induce some form of structured linguistic representation.

Syntactic Processing: Strengths and Limits

Alhilal (2025) provides a focused examination of LLMs' handling of complex syntactic phenomena, including relative clauses, wh-movement, and center-embedding. The study reveals a characteristic pattern: LLMs handle common syntactic constructions with near-human accuracy but degrade significantly on deeply nested structures, garden-path sentences, and constructions that require long-distance dependency tracking. Center-embedded clauses beyond two levels of nesting produce systematic errors, a finding consistent with the hypothesis that LLMs rely on approximate heuristics rather than recursive syntactic rules.

Human-Machine Linguistic Profiling

Zanotto and Aroyehun (2025) approach the question from the output side, comparing the linguistic profiles of human-written and LLM-generated text across multiple dimensions. Their analysis reveals that while LLM outputs are increasingly indistinguishable from human text at the surface level, systematic differences persist in syntactic diversity, lexical richness patterns, and discourse-level coherence structures. LLM text tends toward more uniform syntactic structures and narrower vocabulary distributions, suggesting that statistical optimization may converge on a linguistic register that is fluent but less varied than human production.

Competence Assessment Framework

Linguistic Level	LLM Performance	Evidence Quality	Interpretation
Morphology	Strong	Well-documented	Distributional patterns sufficient
Local syntax	Strong	Benchmark + probing	Likely pattern-based but effective
Long-distance dependencies	Moderate	Systematic degradation	Approximate heuristics, not rules
Semantic composition	Mixed	Task-dependent	Some aspects captured, others not
Pragmatic inference	Weak	Emerging benchmarks	Significant gaps remain
Discourse coherence	Moderate	Profiling studies	Surface fluency masks structural limits

What To Watch

The field is converging on a nuanced position: LLMs acquire structured representations that go beyond simple n-gram statistics, but these representations are not equivalent to human linguistic competence. The next frontier involves causality. Current probing methods show what information is encoded; upcoming work using causal intervention techniques will reveal what information models actually use in processing. Additionally, the emergence of multilingual probing benchmarks like AraLingBench will test whether the competence patterns observed in English generalize across typologically diverse languages, or whether they reflect English-specific distributional properties.

Discover related work using ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

대규모 언어 모델과 언어 능력: 통계적 기계가 언어를 진정으로 이해할 수 있는가?

대규모 언어 모델(LLM)이 진정한 언어 능력을 보유하는지, 아니면 분포 통계를 통해 이를 단순히 근사하는지에 관한 문제는 계산 언어학과 이론 언어학 모두에서 가장 활발한 논쟁 중 하나가 되었다. 촘스키(Chomsky)가 제시한 언어 능력(competence, 언어 규칙에 대한 암묵적 지식)과 언어 수행(performance, 실제 언어 사용)의 구분은 전통적인 논의의 틀을 제공하지만, LLM은 이 이분법에 도전한다. LLM은 명백한 규칙 기반 능력 없이도 놀라운 수행을 달성하기 때문이다. 최근 연구는 철학적 논쟁을 넘어, LLM이 내부적으로 무엇을 표상하고 무엇을 표상하지 않는지를 경험적으로 측정하는 방향으로 나아가고 있다.

왜 중요한가

이 문제의 중요성은 학문적 언어학의 범위를 넘어선다. LLM이 진정으로 언어 지식을 습득한다면, 이는 텍스트만을 이용한 분포 학습만으로도 인간 언어의 구조를 복원하기에 충분하다는 것을 시사하며, 이는 언어 습득 이론과 언어 지식의 본질에 대해 심대한 함의를 가진다. 반대로 LLM이 표면적인 패턴 매칭을 통해 능력을 모방할 뿐이라면, 인상적인 수행 이면에는 근본적인 한계가 숨어 있으며, 이는 진정한 이해가 요구되는 안전 중심 응용, 교육 기술, 법적 맥락에서 드러날 것이다.

실용적 차원에서의 요구 또한 시급하다. LLM이 번역, 콘텐츠 생성, 법률 분석, 의료 커뮤니케이션에 활용되고 있는 상황에서, LLM이 언어 구조를 이해하는지 아니면 단순히 이와 상관관계를 맺는지를 파악하는 것은 진정한 이해를 요하는 맥락에서 그 출력 결과를 얼마나 신뢰할 수 있는지를 결정한다.

연구 내용

언어 능력 벤치마킹

Waldis et al. (2024)은 분류기 기반 프로빙(classifier-based probing)을 통해 언어 모델 내부 표상을 평가하도록 설계된 벤치마크 Holmes를 소개한다. 과제 수행을 검사하는 표준 NLP 벤치마크와 달리, Holmes는 모델이 내부 상태에 별개의 언어 현상을 인코딩하는지를 검토한다. 이 벤치마크는 품사(part-of-speech), 통사 의존성(syntactic dependencies), 의미역(semantic roles), 담화 관계(discourse relations) 등의 현상을 다룬다. 핵심 발견은 언어 모델의 서로 다른 층이 서로 다른 언어 수준을 인코딩한다는 것으로, 통사 정보는 중간 층에, 의미 정보는 후위 층에 집중되는 경향이 있다. 이러한 층별 인코딩은 단순한 평면적 패턴 매칭보다 더 구조화된 무언가가 일어나고 있음을 시사하지만, 이것이 촘스키적 의미의 언어 능력을 구성하는지 여부는 여전히 논쟁의 여지가 있다.

내부 메커니즘의 해석

Jing et al. (2025)은 LinguaLens를 통해 보완적인 접근 방식을 취하며, 희소 자동 인코더(sparse auto-encoders)를 사용하여 LLM이 지시 해소(reference disambiguation)나 은유 인식(metaphor recognition)과 같은 언어 현상을 내부적으로 처리하는 방식을 해석한다. 이 방법은 특정 언어 연산에 대응하는 해석 가능한 특징을 모델의 은닉 상태(hidden states) 내에서 식별한다. 연구 결과에 따르면 LLM은 서로 다른 언어 과제에 대해 특화된 내부 회로(circuits)를 발전시킨다. 예를 들어, 은유 처리는 문자적 언어 처리와 구별되는 특징 조합을 수반한다. 이러한 내부적 분화는 명시적인 언어 훈련 없이 출현한다는 점에서 주목할 만하며, 분포 학습이 어떤 형태의 구조화된 언어 표상을 실제로 유도함을 시사한다.

통사 처리: 강점과 한계

Alhilal (2025)은 관계절, wh-이동, 중심 내포(center-embedding)를 포함한 복잡한 통사 현상에 대한 LLM의 처리 방식을 집중적으로 검토한다. 이 연구는 특징적인 패턴을 밝혀낸다. LLM은 일반적인 통사 구문을 인간에 준하는 정확도로 처리하지만, 깊이 중첩된 구조, 정원 경로 문장(garden-path sentence), 장거리 의존 관계 추적이 요구되는 구문에서는 성능이 현저히 저하된다. 두 단계 이상의 중첩을 포함하는 중심 내포절은 체계적인 오류를 유발하며, 이는 LLM이 재귀적 통사 규칙이 아닌 근사 휴리스틱(approximate heuristics)에 의존한다는 가설과 일치하는 결과이다.

인간-기계 언어 프로파일링

Zanotto와 Aroyehun (2025)은 출력 측면에서 이 문제에 접근하여, 인간이 작성한 텍스트와 LLM이 생성한 텍스트의 언어적 프로파일을 다양한 차원에서 비교한다. 이들의 분석에 따르면, LLM의 출력은 표면적 수준에서 인간 텍스트와 점점 구별하기 어려워지고 있음에도 불구하고, 통사적 다양성, 어휘 풍부성 패턴, 담화 수준의 응집 구조에서는 체계적인 차이가 지속적으로 나타난다. LLM 텍스트는 보다 균일한 통사 구조와 좁은 어휘 분포를 보이는 경향이 있으며, 이는 통계적 최적화가 유창하지만 인간의 언어 산출보다 다양성이 낮은 언어 레지스터로 수렴할 수 있음을 시사한다.

언어 능력 평가 프레임워크

언어 수준	LLM 성능	증거 품질	해석
형태론	강함	충분히 문서화됨	분포적 패턴으로 충분
국소 통사	강함	벤치마크 + 프로빙	패턴 기반이지만 효과적일 가능성
장거리 의존 관계	보통	체계적 성능 저하	규칙이 아닌 근사 휴리스틱
의미 합성	혼재	과제 의존적	일부 측면은 포착, 나머지는 미흡
화용적 추론	약함	신흥 벤치마크	상당한 격차 잔존
담화 응집	보통	프로파일링 연구	표면적 유창성이 구조적 한계를 은폐

주목해야 할 동향

이 분야는 다음과 같은 미묘한 입장으로 수렴하고 있다. LLM은 단순한 n-gram 통계를 넘어서는 구조화된 표상을 획득하지만, 이러한 표상은 인간의 언어 능력과 동등하지 않다. 다음 개척 과제는 인과성에 있다. 현재의 프로빙 방법은 어떤 정보가 인코딩되어 있는지를 보여주는 데 그치지만, 인과적 개입 기법(causal intervention techniques)을 활용한 후속 연구는 모델이 처리 과정에서 실제로 활용하는 정보가 무엇인지를 밝혀낼 것이다. 또한 AraLingBench와 같은 다국어 프로빙 벤치마크의 등장은 영어에서 관찰된 능력 패턴이 유형론적으로 다양한 언어들에 걸쳐 일반화되는지, 아니면 영어 고유의 분포적 특성을 반영하는지를 검증할 것이다.

ORAA ResearchBrain을 통해 관련 연구를 탐색하라.

References (4)

[1] Waldis, A., Perlitz, Y., & Choshen, L. (2024). Holmes: A Benchmark to Assess the Linguistic Competence of Language Models. Transactions of the ACL, 12.

DOI Scholar

[2] Jing, Y., Yao, Z., & Ran, L. (2025). LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder. Proc. EMNLP 2025.

DOI Scholar

[3] Alhilal, M. (2025). Understanding Syntax in Large Language Models: Successes and Limitations. IJESA, 4(1).

DOI Scholar

[4] Zanotto, S.E. & Aroyehun, S. (2025). Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models.

DOI Scholar

Large Language Models and Linguistic Competence: Can Statistical Machines Truly Understand Language?

Why It Matters

The Science

Benchmarking Linguistic Competence

Interpreting Internal Mechanisms

Syntactic Processing: Strengths and Limits

Human-Machine Linguistic Profiling

Competence Assessment Framework

What To Watch

대규모 언어 모델과 언어 능력: 통계적 기계가 언어를 진정으로 이해할 수 있는가?

왜 중요한가

연구 내용

언어 능력 벤치마킹

내부 메커니즘의 해석

통사 처리: 강점과 한계

인간-기계 언어 프로파일링

언어 능력 평가 프레임워크

주목해야 할 동향

References (4)

Explore this topic deeper