Trend AnalysisHistory & Area Studies

Medieval Manuscript Digitization and AI Transcription: Unlocking Centuries of Hidden Text

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Why It Matters

Europe's libraries and archives hold millions of medieval and early modern manuscripts that have never been transcribed, much less analyzed. These documents, ranging from monastic chronicles and tax records to personal letters and scientific treatises, contain vast stores of untapped historical knowledge. For centuries, reading them required years of paleographic training: the ability to decipher handwriting styles that changed across periods, regions, and scribal schools.

Handwritten text recognition (HTR), powered by deep learning, is now making it possible to transcribe these manuscripts at industrial scale. Platforms like Transkribus and models like TrOCR are achieving accuracy rates above 95% on trained script types, transforming what was once a bottleneck measured in scholar-years into a process measured in GPU-hours. The implications are transformative: entire corpora that were accessible only to a handful of specialists are becoming searchable text databases.

Yet challenges remain. Damaged manuscripts, mixed scripts, marginalia, abbreviations, and non-standard orthography all push current models to their limits. The field is advancing rapidly, but the gap between what AI can transcribe and what historians need to understand remains significant.

The Science

Automated Medieval Transcription

Matos et al. (2025) developed iForal, a modular three-stage system for automated transcription of Portuguese medieval manuscripts. The pipeline uses YOLOv8 for layout detection, Mask R-CNN for text line segmentation, and CRNN-based engines (Kraken/Calamari) for character recognition. With 3 citations, the system achieves a best character error rate (CER) of 8.1%, demonstrating the feasibility of specialized HTR for historical scripts where general-purpose OCR is inapplicable due to the complexity of medieval handwriting.

Scale and Access

Matos et al. (2025), with 10 citations, surveyed the broader implications of HTR for information access, arguing that the technology is creating a paradigm shift comparable to the original digitization wave of the 2000s. They warn that uneven access to HTR tools and training data risks creating a "two-speed" digital humanities where well-resourced institutions race ahead while smaller archives fall further behind.

Transformer-Based HTR

Nockels, Gooding, and Terras (2024) applied TrOCR, a transformer-based model, to historical handwritten text recognition, demonstrating state-of-the-art performance on archival documents. The study shows that pre-trained vision-language transformers can be fine-tuned with relatively small amounts of manually transcribed ground truth, dramatically reducing the startup cost for new manuscript collections.

Metadata-Rich Transcription

Meoded (2025) experimented with HTR transcription of the Memoriali series, a collection of Bolognese notarial records spanning 1265-1452. Their innovation was to integrate named entity tagging directly into the transcription pipeline, producing not just text but structured metadata (persons, places, dates) ready for database import, bridging the gap between raw transcription and historical analysis.

HTR Technology Comparison

Technology	Architecture	Strengths	Limitations	Training Data Need
Transkribus	CNN + LSTM	Mature platform, community models	Subscription cost, training overhead	Medium (50-100 pages)
TrOCR	Vision Transformer	Pre-trained, adaptable	Compute-intensive fine-tuning	Low (10-50 pages)
Kraken/eScriptorium	Open-source CNN	Free, customizable	Less polished UX	Medium
Google Cloud Vision	Commercial API	Easy integration	Poor on historical scripts	None (pre-trained)
Custom CNN+CTC	Task-specific	Maximum flexibility	Requires ML expertise	High (100+ pages)

What To Watch

The convergence of HTR with large language models is the next frontier. Instead of recognizing characters independently, future systems will use LLM-powered language models to resolve ambiguities in damaged or poorly written text by predicting likely words from context, essentially reading as a trained paleographer does. Expect 2026 to bring the first large-scale "digital editions" produced primarily by AI, with human scholars shifting from transcribers to editors and validators. Multilingual and multi-script models that can handle code-switching between Latin, vernacular, and Greek within a single manuscript page are also on the horizon.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 구체적인 연구 결과, 통계, 주장은 원본 논문과 대조하여 검증해야 한다.

왜 중요한가

유럽의 도서관과 기록관에는 전사된 적이 없는, 더욱이 분석된 적은 더욱 없는 수백만 점의 중세 및 근세 필사본이 보관되어 있다. 수도원 연대기와 세금 기록에서 개인 서한과 과학 논문에 이르기까지 이 문서들에는 활용되지 않은 방대한 역사적 지식이 담겨 있다. 수 세기 동안 이 문서들을 읽기 위해서는 수년간의 고문서학 훈련이 필요했다. 즉, 시대, 지역, 필경사 학파에 따라 변화하는 필체 양식을 해독하는 능력이 요구되었다.

딥러닝 기반의 필기 텍스트 인식(HTR)은 이제 이러한 필사본을 산업적 규모로 전사하는 것을 가능하게 하고 있다. Transkribus와 같은 플랫폼과 TrOCR과 같은 모델은 훈련된 필체 유형에서 95% 이상의 정확도를 달성하고 있으며, 이는 과거 학자의 수년치 작업량으로 측정되던 병목 현상을 GPU 사용 시간으로 측정되는 프로세스로 전환하고 있다. 그 함의는 혁신적이다. 소수의 전문가만 접근할 수 있었던 전체 코퍼스가 검색 가능한 텍스트 데이터베이스로 전환되고 있다.

그러나 여전히 과제가 남아 있다. 훼손된 필사본, 혼합 필체, 방주(marginalia), 약어, 비표준 정서법은 모두 현재 모델의 한계를 시험한다. 이 분야는 빠르게 발전하고 있지만, AI가 전사할 수 있는 것과 역사가들이 이해해야 하는 것 사이의 간극은 여전히 크다.

과학적 내용

중세 필사본 자동 전사

Matos et al. (2025)은 포르투갈 중세 필사본의 자동 전사를 위한 모듈형 3단계 시스템인 iForal을 개발하였다. 이 파이프라인은 레이아웃 감지에 YOLOv8, 텍스트 줄 분할에 Mask R-CNN, 문자 인식에 CRNN 기반 엔진(Kraken/Calamari)을 사용한다. 3회 인용된 이 시스템은 최적 문자 오류율(CER) 8.1%를 달성하며, 중세 필기의 복잡성으로 인해 범용 OCR을 적용할 수 없는 역사적 필체에 대한 특화된 HTR의 실현 가능성을 입증하였다.

규모와 접근성

10회 인용된 Matos et al. (2025)은 정보 접근성에 대한 HTR의 광범위한 함의를 조사하였으며, 이 기술이 2000년대의 최초 디지털화 물결에 비견되는 패러다임 전환을 일으키고 있다고 주장하였다. 이들은 HTR 도구와 훈련 데이터에 대한 불균등한 접근이 자원이 풍부한 기관은 앞서 나가고 소규모 기록관은 더욱 뒤처지는 "이중 속도"의 디지털 인문학을 만들 위험이 있다고 경고하였다.

트랜스포머 기반 HTR

Nockels, Gooding, Terras (2024)는 트랜스포머 기반 모델인 TrOCR을 역사적 필기 텍스트 인식에 적용하여 기록 문서에서 최첨단 성능을 시연하였다. 이 연구는 사전 훈련된 비전-언어 트랜스포머가 비교적 소량의 수동 전사 정답 데이터만으로 미세 조정될 수 있음을 보여 주며, 새로운 필사본 컬렉션의 초기 구축 비용을 획기적으로 줄였다.

메타데이터 풍부 전사

Meoded (2025)는 1265년부터 1452년까지의 볼로냐 공증 기록 컬렉션인 Memoriali 시리즈의 HTR 전사를 실험하였다. 이들의 혁신은 개체명 태깅을 전사 파이프라인에 직접 통합하여 단순한 텍스트가 아닌 데이터베이스 입력에 바로 활용 가능한 구조화된 메타데이터(인물, 장소, 날짜)를 생성함으로써, 원시 전사와 역사 분석 사이의 간극을 좁혔다는 점이다.

HTR 기술 비교

기술	아키텍처	강점	한계	훈련 데이터 필요량
Transkribus	CNN + LSTM	성숙한 플랫폼, 커뮤니티 모델	구독 비용, 훈련 부담	중간 (50-100페이지)
TrOCR	Vision Transformer	사전 훈련됨, 적응성 우수	미세 조정 시 높은 연산 요구	낮음 (10-50페이지)
Kraken/eScriptorium	오픈소스 CNN	무료, 커스터마이징 가능	사용자 경험 다소 미흡	중간
Google Cloud Vision	상업용 API	쉬운 통합	역사적 필사본에 취약	없음 (사전 학습됨)
Custom CNN+CTC	과제 특화형	최대 유연성	ML 전문 지식 필요	높음 (100페이지 이상)

주목할 동향

HTR과 대규모 언어 모델(LLM)의 융합이 다음 프론티어이다. 미래의 시스템은 문자를 독립적으로 인식하는 방식 대신, LLM 기반 언어 모델을 활용하여 손상되거나 필체가 불분명한 텍스트의 모호성을 문맥으로부터 유력한 단어를 예측함으로써 해소할 것이며, 이는 본질적으로 훈련된 고문서학자(paleographer)가 읽는 방식과 동일하다. 2026년에는 주로 AI가 생성한 최초의 대규모 "디지털 편집본(digital editions)"이 등장하고, 인문학 연구자들은 전사자(transcriber)에서 편집자 및 검증자로 역할이 전환될 것으로 예상된다. 단일 필사본 페이지 내에서 라틴어, 자국어, 그리스어 간의 코드 전환(code-switching)을 처리할 수 있는 다국어·다문자 모델 또한 가시권에 들어오고 있다.

References (4)

Matos, A., Almeida, P., Correia, P., & Pacheco, O. (2025). iForal: Automated Handwritten Text Transcription for Historical Medieval Manuscripts. Journal of Imaging, 11(2), 36.

DOI Scholar

Nockels, J., Gooding, P., & Terras, M. (2024). The implications of handwritten text recognition for accessing the past at scale. Journal of Documentation, 80(7), 148-167.

DOI Scholar

Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models.

DOI Scholar

Loss, E., Guernaccini, F., & Carassai, M. (2025). From Manuscript to Metadata: experiments on Handwritten Text Recognition, Tagging and Importation for the Memoriali series (1265-1452). JLIS.it, 16(2), 59-85.

DOI Scholar

Medieval Manuscript Digitization and AI Transcription: Unlocking Centuries of Hidden Text

Why It Matters

The Science

Automated Medieval Transcription

Scale and Access

Transformer-Based HTR

Metadata-Rich Transcription

HTR Technology Comparison

What To Watch

왜 중요한가

과학적 내용

중세 필사본 자동 전사

규모와 접근성

트랜스포머 기반 HTR

메타데이터 풍부 전사

HTR 기술 비교

주목할 동향

References (4)

Explore this topic deeper