Trend AnalysisHistory & Area Studies

Digital Humanities and Computational Text Analysis: NLP Meets the Archive

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Why It Matters

The marriage of natural language processing and historical scholarship is transforming how we read the past. Where a lone scholar once spent years close-reading a single archive, large language models and corpus-level analytics now make it possible to interrogate millions of pages simultaneously, surfacing patterns of discourse, sentiment, and network formation that no human eye could perceive unaided. This shift from "close reading" to "distant reading" does not replace traditional hermeneutics; rather, it augments it with statistical breadth.

The stakes are especially high for non-Latin-script traditions. Historical Chinese, Arabic, and Persian texts demand specialized tokenizers, named-entity recognizers, and part-of-speech taggers that mainstream NLP pipelines were never designed for. Recent 2024-2025 work demonstrates that modern LLMs can rival or exceed bespoke rule-based tools on these challenging corpora, opening archives that have remained computationally inaccessible.

As digital humanities matures, the field faces an accountability question: when an algorithm identifies a pattern across 10,000 documents, how do historians validate the finding? Reproducibility, bias auditing, and human-in-the-loop verification are becoming methodological imperatives.

The Science

LLMs vs. Traditional NLP on Historical Texts

Pawlowski and Walkowiak (2024) benchmarked GPT-class models against classical NLP tools for word segmentation, POS tagging, and NER on Chinese texts from 1900-1950. LLMs outperformed traditional pipelines on ambiguous segmentations and low-frequency named entities, though they occasionally hallucinated entity boundaries in documents with heavy classical-vernacular code-switching.

Chronological Corpus Processing

Fang et al. (2025) developed a pipeline for sequential text corpora that preserves temporal metadata through every processing stage. Their approach treats documents not as isolated bags-of-words but as points in a chronological stream, enabling diachronic topic modeling that tracks how political vocabularies shifted across decades.

Scale and Accessibility

Pawłowski and Walkowiak (2024) surveyed the implications of handwritten text recognition (HTR) for large-scale historical access. They found that HTR accuracy now exceeds 95% on many scripts, but warned that uneven digitization creates "shadow archives" where well-funded collections dominate computational scholarship while Global South materials remain invisible.

Distant Reading and Interpretation

Nockels, Gooding, and Terras (2024) examined how distant reading, topic modeling, and NLP alter literary interpretation across Modernist, Postmodernist, and Contemporary texts. They argue that computational methods surface structural patterns (e.g., shifting pronoun usage, thematic clustering) that complement but never replace contextual close reading.

Computational Text Analysis: Tool Comparison

Approach	Strengths	Limitations	Best For
Rule-based NLP	Transparent, reproducible	Language-specific, brittle	Well-documented scripts
Fine-tuned LLMs	Contextual, multilingual	Hallucination risk, costly	Rare/historical languages
Topic Modeling (LDA)	Unsupervised, scalable	Requires tuning k, ignores syntax	Large corpora exploration
Word Embeddings	Captures semantic drift	Needs large training data	Diachronic lexical change
HTR + OCR Pipelines	Enables digitization at scale	Accuracy varies by script quality	Manuscript archives

What To Watch

The next frontier is multimodal historical analysis, integrating text with maps, images, and material culture databases into unified computational frameworks. Expect 2026 to bring the first large-scale benchmarks for historical multilingual LLMs, purpose-built on pre-modern corpora rather than fine-tuned from modern web text. The epistemological debate, whether algorithms can "understand" historical context or merely pattern-match, will intensify as these tools become standard in tenure-track research.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 저작에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 반드시 검증해야 한다.

중요성

자연어 처리와 역사 학문의 결합은 우리가 과거를 읽는 방식을 변화시키고 있다. 한 명의 학자가 단일 아카이브를 정독하는 데 수년을 보내던 시대에서, 이제는 대형 언어 모델(LLM)과 코퍼스 수준의 분석이 수백만 페이지를 동시에 탐구할 수 있게 해주며, 인간의 눈으로는 단독으로 파악할 수 없었던 담론, 감성, 네트워크 형성의 패턴을 드러낸다. "정독(close reading)"에서 "원독(distant reading)"으로의 이러한 전환은 전통적인 해석학을 대체하는 것이 아니라, 통계적 폭넓음으로 이를 보완한다.

비라틴 문자 전통에 있어 그 중요성은 더욱 크다. 고대 중국어, 아랍어, 페르시아어 텍스트는 주류 NLP 파이프라인이 결코 설계된 바 없는 특수화된 토크나이저, 개체명 인식기(NER), 품사 태거(POS tagger)를 필요로 한다. 2024-2025년의 최근 연구는 현대 LLM이 이러한 까다로운 코퍼스에서 기존의 규칙 기반 도구에 필적하거나 이를 능가할 수 있음을 보여주며, 그동안 계산적으로 접근 불가능했던 아카이브를 개방하고 있다.

디지털 인문학이 성숙해짐에 따라, 이 분야는 책임의 문제에 직면한다. 알고리즘이 10,000개의 문서에 걸쳐 패턴을 식별할 때, 역사학자들은 어떻게 그 결과를 검증하는가? 재현 가능성, 편향 감사, 인간 참여 검증(human-in-the-loop verification)은 점점 더 방법론적 필수 요건이 되고 있다.

연구 내용

역사적 텍스트에서 LLM 대 전통적 NLP

Pawlowski와 Walkowiak(2024)은 1900-1950년의 중국어 텍스트에 대해 단어 분절, POS 태깅, NER을 기준으로 GPT 계열 모델과 고전적 NLP 도구를 벤치마크 비교하였다. LLM은 모호한 분절 및 저빈도 개체명에서 전통적 파이프라인보다 우수한 성능을 보였으나, 고전어와 구어의 코드 전환이 빈번한 문서에서는 간혹 개체 경계를 잘못 생성(hallucinate)하는 경우도 있었다.

연대기적 코퍼스 처리

Fang 외(2025)는 모든 처리 단계에서 시간적 메타데이터를 보존하는 순차적 텍스트 코퍼스를 위한 파이프라인을 개발하였다. 이 접근법은 문서를 독립된 단어 집합(bag-of-words)이 아닌 연대기적 흐름상의 지점으로 취급함으로써, 수십 년에 걸쳐 정치적 어휘가 어떻게 변화했는지를 추적하는 통시적 토픽 모델링(diachronic topic modeling)을 가능하게 한다.

규모와 접근성

Pawłowski와 Walkowiak(2024)은 대규모 역사적 접근을 위한 필기 텍스트 인식(HTR)의 함의를 검토하였다. 이들은 HTR 정확도가 많은 문자 체계에서 이제 95%를 초과한다는 것을 발견했으나, 불균등한 디지털화가 "그림자 아카이브(shadow archives)"를 생성한다고 경고하였다. 즉, 자금이 풍부한 컬렉션이 계산 학문을 지배하는 반면 글로벌 사우스의 자료는 여전히 가시화되지 않는다는 것이다.

원독과 해석

Nockels, Gooding, Terras(2024)는 원독, 토픽 모델링, NLP가 모더니즘, 포스트모더니즘, 현대 텍스트 전반에 걸쳐 문학적 해석을 어떻게 변화시키는지를 검토하였다. 이들은 계산적 방법이 구조적 패턴(예: 변화하는 대명사 사용, 주제적 군집화)을 드러내며 이것이 맥락적 정독을 보완하지만 결코 대체할 수는 없다고 주장한다.

계산 텍스트 분석: 도구 비교

접근법	강점	한계	최적 활용 대상
규칙 기반 NLP	투명하고 재현 가능함	언어별 특수성, 취약성	문서화가 잘 된 문자 체계
파인튜닝된 LLM	맥락적, 다국어 지원	환각 위험, 비용 부담	희귀/역사적 언어
토픽 모델링 (LDA)	비지도 학습, 확장 가능	k 튜닝 필요, 구문 무시	대규모 코퍼스 탐색
단어 임베딩	의미 변화 포착	대규모 학습 데이터 필요	통시적 어휘 변화
HTR + OCR 파이프라인	대규모 디지털화 가능	문자 품질에 따라 정확도 편차	필사본 아카이브

주목할 동향

다음 개척지는 다중 모달 역사 분석으로, 텍스트를 지도, 이미지, 물질 문화 데이터베이스와 통합하여 통합된 계산 프레임워크로 구성하는 것이다. 2026년에는 현대 웹 텍스트에서 파인튜닝된 것이 아니라 전근대 코퍼스를 기반으로 특수 제작된 역사적 다국어 LLM을 위한 최초의 대규모 벤치마크가 등장할 것으로 예상된다. 알고리즘이 역사적 맥락을 "이해"할 수 있는지, 아니면 단순히 패턴 매칭만 수행하는지에 대한 인식론적 논쟁은 이러한 도구들이 종신 재직권 연구에서 표준이 됨에 따라 더욱 심화될 것이다.

References (4)

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950.

DOI Scholar

Pawłowski, A., & Walkowiak, T. (2024). NLP for Digital Humanities: Processing Chronological Text Corpora. Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, 105-112.

DOI Scholar

Nockels, J., Gooding, P., & Terras, M. (2024). The implications of handwritten text recognition for accessing the past at scale. Journal of Documentation, 80(7), 148-167.

DOI Scholar

Azmat Ali Khan, Naima Minhas, & Muhammad Ashraf Kaloi (2025). From Text to Tech: Exploring the Impact of Digital Humanities on Literary Interpretation. Review Journal of Social Psychology & Social Works, 3(2), 620-636.

DOI Scholar

Digital Humanities and Computational Text Analysis: NLP Meets the Archive

Why It Matters

The Science

LLMs vs. Traditional NLP on Historical Texts

Chronological Corpus Processing

Scale and Accessibility

Distant Reading and Interpretation

Computational Text Analysis: Tool Comparison

What To Watch

중요성

연구 내용

역사적 텍스트에서 LLM 대 전통적 NLP

연대기적 코퍼스 처리

규모와 접근성

원독과 해석

계산 텍스트 분석: 도구 비교

주목할 동향

References (4)

Explore this topic deeper