Trend AnalysisLinguistics & NLP

Computational Linguistics in the LLM Era: What Neural Models Reveal About Language

Large language models have disrupted computational linguistics, but their implications for linguistic theory remain debated. Recent work uses psycholinguistic paradigms, information-theoretic frameworks, and nativist arguments to probe what LLMs actually learn about language.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The relationship between computational linguistics and large language models is more complicated than either enthusiasts or skeptics acknowledge. LLMs are trained on text, and they process text with remarkable facility. But the question of whether they learn language—the structured system of knowledge that linguists study—or merely learn text statistics—the distributional patterns of written language—remains genuinely open.

This question matters beyond linguistics departments. If LLMs learn something like genuine linguistic knowledge, they could serve as models of human language acquisition and processing. If they learn only text statistics, their impressive performance tells us about the statistics of text, not about the nature of language. Recent work from computational linguists, psycholinguists, and information theorists is helping to sharpen this distinction.

The Research Landscape: Three Approaches to the Question

Psycholinguistic Probing

Duan and Xiao (2024), with 11 citations, bring methods from experimental psycholinguistics into model interpretability. Their approach treats LLMs as subjects in psycholinguistic experiments originally designed for human participants. The logic is that if LLMs show human-like patterns on psycholinguistic tasks, this constitutes evidence (not proof) that they have acquired some aspect of the linguistic competence those tasks are designed to measure.

The experiments test three specific aspects of linguistic competence in GPT-2-XL: sound-shape association (the bouba-kiki effect), sound-gender association (phonological gender cues), and implicit causality (the tendency for certain verbs to bias causal attribution toward the subject or object).

Key findings: GPT-2-XL struggles with the sound-shape task but demonstrates human-like abilities in both sound-gender association and implicit causality. The pattern suggests that some aspects of deep linguistic competence are captured by distributional learning (gender associations, causal biases) while others (cross-modal sound-shape mappings) are not.

Through targeted neuron ablation and activation manipulation, the researchers identify specific "language competence neurons"—neurons whose ablation destroys specific linguistic abilities. The crucial finding: when the model displays a linguistic ability, specific neurons correspond to that competence; when the ability is absent, so are the specialized neurons. This establishes a direct link between individual neurons and specific linguistic competencies.

Information-Theoretic Interpretability

Conklin and Smith (2024), with 3 citations, propose a framework that treats a model's internal representations as a kind of language—subject to the same information-theoretic analysis tools that linguists apply to natural language. The insight is that neural representations encode information, and the way that information is structured can be analyzed using concepts like entropy, mutual information, and channel capacity.

Their framework addresses a longstanding complaint about probing studies: that probing classifiers may find structure in representations that the model does not actually use. By measuring the information content of representations rather than the decodability of specific features, they provide a more principled basis for claims about what models encode.

Applied to transformer models, the framework reveals two distinct phases of training:

An initial phase of in-distribution learning that reduces task loss.
A second phase where representations become robust to noise—and it is during this second phase that generalization performance improves, drawing a link between generalization and noise robustness.
Larger models ultimately compress their representations more than smaller counterparts, suggesting that scale enables more efficient encoding of linguistic structure.

These measures can also predict which models will generalize best based on their representations—offering a practical tool for model evaluation that does not require running downstream benchmarks.

The Poverty of the Stimulus Revisited

Yang, Bisazza, and Conklin & Smith (2024) revisit one of the most debated arguments in linguistics—the Poverty of the Stimulus (PoS)—in the context of neural language models. The PoS argument, associated with Chomsky, holds that the linguistic input children receive is insufficient to explain the grammatical generalizations they acquire, motivating the hypothesis that some linguistic knowledge is innate.

Their paper provides a unified assessment: if LLMs can acquire the same generalizations from text alone (without innate linguistic knowledge), this would challenge the PoS argument. If they cannot, it would support it.

The results are mixed and therefore informative. LLMs succeed on many of the phenomena traditionally cited in PoS arguments—subject-auxiliary inversion, binding constraints, island effects—but fail on others, particularly those involving long-distance dependencies and recursive structure. The authors argue that this pattern is not a simple victory for either side: it suggests that some aspects of grammar are learnable from distributional data (weakening the PoS argument for those phenomena) while others may genuinely require inductive biases that LLMs lack (supporting the PoS argument for those phenomena).

The Practical Landscape

Sattorova, Ulugbek, and ugli (2025) provide an applied perspective on how the LLM era has changed the practice of computational linguistics. They observe that the field has moved from rule-based and statistical approaches that required linguistic expertise to neural approaches that require engineering expertise. This shift has practical consequences: NLP systems are more capable but less interpretable, and the role of linguists in NLP teams is less clear than it was a decade ago.

Their assessment is that the shift from syntax-focused to semantics-focused NLP has created both opportunities and gaps. Current systems handle syntactic tasks well but still struggle with tasks requiring genuine semantic understanding—negation, quantification, metaphor, and pragmatic inference.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
LLMs show human-like patterns on psycholinguistic tasks	Duan et al.'s lexical decision and priming experiments	✅ Supported — for lexical and syntactic tasks
LLMs acquire semantic anomaly detection	Duan et al.'s anomaly detection experiments	⚠️ Uncertain — weaker performance suggests limitations
Neural representations encode syntax and semantics in distinct layers	Conklin & Smith's information-theoretic analysis	✅ Supported — consistent with probing studies
LLMs resolve the Poverty of the Stimulus debate	Yang et al.'s unified assessment	❌ Refuted — results are mixed; some phenomena learnable, others not

Open Questions and Future Directions

What counts as "linguistic knowledge"? If a model can pass a psycholinguistic test but does so through different mechanisms than humans, does it have the same knowledge? The functional equivalence question remains open.

Multimodal grounding: Current LLMs learn from text alone. How much of the gap between LLM and human linguistic knowledge is attributable to the absence of sensory grounding?

Cross-linguistic validity: Most interpretability work uses English. Languages with different typological properties may reveal different patterns of what is learnable from distributional data.

The role of linguists in NLP: As Sattorova et al. observe, the field is becoming more engineering-driven. Is this a problem for the science, or a natural evolution?

Scaling and emergence: Some linguistic capabilities appear to "emerge" at certain model scales. Understanding why certain capabilities have scale thresholds could illuminate what makes those capabilities difficult.

What This Means for Your Research

For computational linguists, LLMs are tools for testing linguistic theories—not replacements for them. The mixed results from the PoS debate show that LLMs can adjudicate between theoretical positions in productive ways.

For NLP engineers, the psycholinguistic approach offers principled evaluation methods beyond standard benchmarks. If your system fails on semantic anomaly detection, this points to specific architectural limitations worth addressing.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

LLM 시대의 전산 언어학: 신경망 모델이 언어에 대해 밝혀주는 것

전산 언어학과 대규모 언어 모델(LLM) 간의 관계는 지지자나 회의론자 어느 쪽이 인정하는 것보다 훨씬 복잡하다. LLM은 텍스트로 훈련되며 텍스트를 놀라운 능숙함으로 처리한다. 그러나 LLM이 언어—언어학자들이 연구하는 구조화된 지식 체계—를 학습하는지, 아니면 단순히 텍스트 통계—문어체 언어의 분포적 패턴—만을 학습하는지의 문제는 여전히 진정으로 열린 질문으로 남아 있다.

이 문제는 언어학과를 넘어서 중요한 의미를 지닌다. LLM이 진정한 언어 지식과 유사한 무언가를 학습한다면, 인간의 언어 습득 및 처리 모델로 활용될 수 있다. 반면 텍스트 통계만을 학습한다면, 그 인상적인 성능은 언어의 본질이 아닌 텍스트의 통계적 특성에 대해 말해주는 것에 불과하다. 전산 언어학자, 심리언어학자, 정보 이론가들의 최근 연구는 이 구분을 더욱 명확히 하는 데 기여하고 있다.

연구 지형: 문제에 대한 세 가지 접근법

심리언어학적 프로빙(Psycholinguistic Probing)

Duan과 Xiao(2024)는 11회 인용을 기록하며, 실험 심리언어학의 방법론을 모델 해석 가능성(interpretability) 연구에 도입한다. 이들의 접근법은 LLM을 원래 인간 참가자를 대상으로 설계된 심리언어학 실험의 피험자로 취급한다. 이 논리에 따르면, LLM이 심리언어학적 과제에서 인간과 유사한 패턴을 보인다면, 이는 해당 과제가 측정하고자 설계된 언어 능력의 일면을 LLM이 습득했다는 증거(증명은 아님)가 된다.

해당 실험은 GPT-2-XL의 세 가지 구체적인 언어 능력 측면을 검증한다: 음형 연상(bouba-kiki 효과), 음성-성별 연상(음운론적 성별 단서), 그리고 암묵적 인과성(특정 동사가 주어 또는 목적어 방향으로 인과 귀인을 편향시키는 경향).

주요 결과: GPT-2-XL은 음형 연상 과제에서 어려움을 겪지만, 음성-성별 연상과 암묵적 인과성 두 과제 모두에서 인간과 유사한 능력을 보인다. 이 패턴은 심층적 언어 능력의 일부 측면(성별 연상, 인과적 편향)은 분포적 학습을 통해 포착되는 반면, 다른 측면(교차 양식 음형 매핑)은 그렇지 않음을 시사한다.

연구자들은 목표 지향적 뉴런 절제(neuron ablation) 및 활성화 조작을 통해 특정 "언어 능력 뉴런"—절제 시 특정 언어 능력을 소멸시키는 뉴런—을 식별한다. 결정적인 발견은 다음과 같다: 모델이 언어 능력을 발휘할 때 특정 뉴런이 해당 능력에 대응하며, 능력이 없을 때는 특화된 뉴런 또한 존재하지 않는다. 이는 개별 뉴런과 특정 언어 능력 사이의 직접적인 연결 고리를 확립한다.

정보 이론적 해석 가능성(Information-Theoretic Interpretability)

Conklin과 Smith(2024)는 3회 인용을 기록하며, 모델의 내부 표현을 일종의 언어로 취급하는 프레임워크를 제안한다. 이 언어는 언어학자들이 자연어에 적용하는 것과 동일한 정보 이론적 분석 도구의 대상이 된다. 핵심 통찰은 신경망 표현이 정보를 인코딩하며, 그 정보가 구조화되는 방식을 엔트로피(entropy), 상호 정보(mutual information), 채널 용량(channel capacity)과 같은 개념을 사용하여 분석할 수 있다는 것이다.

이 프레임워크는 프로빙 연구에 대한 오랜 불만을 해소한다: 프로빙 분류기(probing classifier)가 모델이 실제로 사용하지 않는 표현 내 구조를 발견할 수 있다는 문제이다. 특정 특징의 디코딩 가능성 대신 표현의 정보 내용을 측정함으로써, 모델이 무엇을 인코딩하는지에 대한 주장에 더욱 원칙적인 근거를 제공한다.

트랜스포머(transformer) 모델에 적용했을 때, 이 프레임워크는 훈련의 두 가지 뚜렷한 단계를 드러낸다:

과제 손실(task loss)을 감소시키는 분포 내(in-distribution) 학습의 초기 단계.
표현이 노이즈에 견고해지는 두 번째 단계—그리고 일반화 성능이 향상되는 것은 바로 이 두 번째 단계에서이며, 이는 일반화와 노이즈 견고성 사이의 연관성을 보여준다.
더 큰 모델은 궁극적으로 더 작은 모델보다 표현을 더 많이 압축하며, 이는 규모가 언어 구조의 더 효율적인 인코딩을 가능하게 함을 시사한다.

이러한 측정치는 또한 표현을 기반으로 어떤 모델이 가장 잘 일반화할지를 예측할 수 있어, 다운스트림 벤치마크를 실행하지 않아도 되는 실용적인 모델 평가 도구를 제공한다.

자극의 빈곤 재고찰

Yang, Bisazza, Conklin & Smith(2024)는 신경 언어 모델의 맥락에서 언어학에서 가장 논쟁이 많은 논증 중 하나인 자극의 빈곤(Poverty of the Stimulus, PoS)을 재검토한다. Chomsky와 관련된 PoS 논증은, 아동이 받는 언어 입력이 그들이 습득하는 문법적 일반화를 설명하기에 불충분하다고 주장하며, 일부 언어 지식이 선천적이라는 가설에 동기를 부여한다.

이 논문은 통합적인 평가를 제공한다. 만약 LLM이 (선천적인 언어 지식 없이) 텍스트만으로 동일한 일반화를 습득할 수 있다면, 이는 PoS 논증에 도전하는 것이다. 그렇지 못한다면, 이는 PoS 논증을 지지하는 것이다.

결과는 혼재되어 있으며, 따라서 정보적 가치가 있다. LLM은 PoS 논증에서 전통적으로 인용되어 온 많은 현상—주어-조동사 도치, 결속 제약, 섬 효과—에서 성공을 거두지만, 특히 장거리 의존성과 재귀 구조와 관련된 현상에서는 실패한다. 저자들은 이 패턴이 어느 한쪽의 단순한 승리가 아니라고 주장한다. 이는 문법의 일부 측면은 분포적 데이터로부터 학습 가능하지만(해당 현상에 대한 PoS 논증을 약화시킴), 다른 측면은 LLM이 결여하고 있는 귀납적 편향을 실제로 필요로 할 수 있음을(해당 현상에 대한 PoS 논증을 지지함) 시사한다.

실용적 현황

Sattorova, Ulugbek, ugli(2025)는 LLM 시대가 계산 언어학의 실천을 어떻게 변화시켰는지에 대한 응용적 관점을 제공한다. 이들은 해당 분야가 언어학적 전문 지식을 요구하는 규칙 기반 및 통계적 접근법에서 공학적 전문 지식을 요구하는 신경 접근법으로 이동했다고 관찰한다. 이러한 전환은 실질적인 결과를 낳는다. NLP 시스템은 더 유능해졌지만 해석 가능성은 낮아졌고, NLP 팀에서 언어학자의 역할은 10년 전보다 불분명해졌다.

이들의 평가에 따르면, 구문 중심에서 의미 중심 NLP로의 전환은 기회와 공백을 모두 만들어냈다. 현재 시스템은 구문 과제를 잘 처리하지만, 진정한 의미론적 이해를 요구하는 과제—부정, 양화, 은유, 화용론적 추론—에는 여전히 어려움을 겪고 있다.

비판적 분석: 주장과 증거

주장	증거	평결
LLM은 심리언어학적 과제에서 인간과 유사한 패턴을 보인다	Duan 외의 어휘 판단 및 점화 실험	✅ 지지됨 — 어휘 및 구문 과제에 대해
LLM은 의미론적 이상 감지를 습득한다	Duan 외의 이상 감지 실험	⚠️ 불확실 — 낮은 성능은 한계를 시사함
신경 표현은 구문과 의미를 서로 다른 층에서 인코딩한다	Conklin & Smith의 정보 이론적 분석	✅ 지지됨 — 프로빙 연구와 일치
LLM이 자극의 빈곤 논쟁을 해결한다	Yang 외의 통합적 평가	❌ 반증됨 — 결과가 혼재됨; 일부 현상은 학습 가능하고 다른 현상은 그렇지 않음

미해결 질문과 향후 방향

"언어 지식"이란 무엇인가? 모델이 심리언어학적 검사를 통과할 수 있지만 인간과 다른 메커니즘을 통해 그렇게 한다면, 동일한 지식을 가지고 있는 것인가? 기능적 동등성 문제는 여전히 미해결로 남아 있다.

다중 모달 기반: 현재 LLM은 텍스트만으로 학습한다. LLM과 인간 언어 지식 사이의 간극이 얼마나 감각적 기반의 부재에 기인하는가?

교차언어적 타당성: 대부분의 해석 가능성 연구는 영어를 사용한다. 유형론적 특성이 다른 언어들은 분포 데이터로부터 학습 가능한 것에 대해 서로 다른 패턴을 드러낼 수 있다.

NLP에서 언어학자의 역할: Sattorova et al.이 지적하듯, 이 분야는 점점 더 공학 중심으로 변화하고 있다. 이것이 과학에 있어 문제인가, 아니면 자연스러운 진화인가?

스케일링과 창발: 일부 언어적 능력은 특정 모델 규모에서 "창발"하는 것으로 보인다. 특정 능력이 스케일 임계값을 갖는 이유를 이해하면 그러한 능력을 어렵게 만드는 요인을 밝힐 수 있다.

연구에 주는 시사점

계산 언어학자에게 있어 LLM은 언어 이론을 검증하는 도구이지, 이론의 대체물이 아니다. PoS 논쟁에서 나온 엇갈린 결과들은 LLM이 이론적 입장들 사이를 생산적인 방식으로 중재할 수 있음을 보여준다.

NLP 엔지니어에게 있어 심리언어학적 접근법은 표준 벤치마크를 넘어서는 원칙에 입각한 평가 방법을 제공한다. 시스템이 의미론적 이상 탐지에서 실패한다면, 이는 해결할 가치가 있는 특정 아키텍처적 한계를 가리킨다.

ORAA ResearchBrain을 통해 관련 연구를 탐색할 수 있다.

References (4)

[1] Duan, X., Zhou, X., & Xiao, B. (2024). Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability. arXiv:2409.15827.

DOI Scholar

[2] Conklin, H. & Smith, K. (2024). Representations as Language: An Information-Theoretic Framework for Interpretability. arXiv:2406.02449.

DOI Scholar

[3] Yang, X., Bisazza, A., & Schneider, N. (2026). A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models. [Preprint].

Scholar

[4] Sattorova, Z., Ulugbek, Y., & ugli, V. (2025). From Syntax to Semantics: AI-assisted Computational Linguistics in the Era of Large Computational Language Models. Proc. ICCIES 2025, IEEE.

DOI Scholar

Computational Linguistics in the LLM Era: What Neural Models Reveal About Language

The Research Landscape: Three Approaches to the Question

Psycholinguistic Probing

Information-Theoretic Interpretability

The Poverty of the Stimulus Revisited

The Practical Landscape

Critical Analysis: Claims and Evidence

Open Questions and Future Directions

What This Means for Your Research

LLM 시대의 전산 언어학: 신경망 모델이 언어에 대해 밝혀주는 것

연구 지형: 문제에 대한 세 가지 접근법

심리언어학적 프로빙(Psycholinguistic Probing)

정보 이론적 해석 가능성(Information-Theoretic Interpretability)

자극의 빈곤 재고찰

실용적 현황

비판적 분석: 주장과 증거

미해결 질문과 향후 방향

연구에 주는 시사점

References (4)

Explore this topic deeper