History & Area Studies

The Semantic Gap in Heritage Digitization: When Objects Become Data

When a museum digitizes an object, it translates a physical thing into a data recordโ€”and that translation is never neutral. Recent research on the 'semantic gap' in heritage digitization reveals how metadata schemas, classification systems, and AI tools shape what we can find, study, and understand.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

A ceramic bowl in a museum display case is experienced through its form, color, weight, texture, and the patina of age. The same bowl in a digital catalog is experienced through a photograph, a set of metadata fields (material: ceramic; period: Joseon dynasty; dimensions: 12cm ร— 8cm), and a classification number. Something is gained in the translationโ€”searchability, accessibility, comparabilityโ€”and something is lost. The question of what is lost, and whether it matters, is at the center of current debates about heritage digitization.

The Semantic Gap

Spuriล†a (2025) examines the "semantic gap" in museum digitization through a comparative analysis of national museum databases in Latvia, Estonia, and Finland. The semantic gap, as defined in the information retrieval literature, is the distance between what humans understand when they look at an object and what a machine-readable data structure can represent.

The comparative analysis reveals several dimensions of this gap:

Descriptive standardization vs. cultural specificity. International metadata standards (Dublin Core, CIDOC-CRM) provide interoperability across collections, but their generic categories may not capture culturally specific attributes. A Latvian folk textile that carries meanings tied to specific regions, ceremonies, and kinship systems is reduced to "textile, wool, 19th century" in a standardized schema.

Multilingual description. Baltic museum databases use national languages, but cross-border discoverability requires translation. Machine translation can help but introduces its own errors, particularly for specialized heritage terminology that may not appear in general translation models.

Visual vs. textual information. Digital photographs capture visual properties but not material properties (weight, texture, smell) or contextual properties (how the object was displayed, used, or valued). The semantic gap is not just between human and machine understandingโ€”it is between visual and non-visual aspects of heritage objects.

LLM-Assisted Digital Libraries

El Ganadi and Ruozzi (2025), with 1 citation, present the Digital Maktaba projectโ€”a metadata-driven framework for Arabic digital libraries that uses large language models to assist with cataloging and metadata generation.

The project addresses a specific challenge: Arabic manuscript collections often have incomplete or inconsistent metadata, reflecting varying cataloging practices across institutions and centuries. LLMs can help by:

  • Generating standardized metadata from free-text descriptions.
  • Translating metadata between Arabic, English, and other languages.
  • Identifying entities (authors, places, dates) in manuscript colophons and marginalia.
The results are mixed: LLMs perform well for standard bibliographic metadata (title, author, date) but struggle with specialized heritage metadata (script type, binding style, provenance chain). This pattern is expectedโ€”LLMs are trained primarily on modern text and have limited exposure to the specialized vocabulary of heritage cataloging.

The project's value lies in its framework rather than its current performance: it demonstrates how LLMs can be integrated into heritage workflows as assistive tools (augmenting human catalogers) rather than replacement tools (automating cataloging entirely).

AIGC for Intangible Heritage

Xinyu, Lei, and El Ganadi et al. (2025), with 1 citation, examine how AI-Generated Content (AIGC) technologiesโ€”including text generation, image synthesis, and video creationโ€”can be applied to the digital display of intangible cultural heritage (ICH). The focus is on Chinese ICH categories: traditional performances, crafts, rituals, and oral traditions.

The paper identifies a paradox: AIGC can create visually compelling representations of ICH practices (animated demonstrations of a craft technique, synthetic video of a historical performance), but these representations may lack the cultural authenticity that makes ICH valuable. A generated video of a traditional dance may be visually accurate but miss the subtle embodied knowledge that distinguishes a master from a student.

The authors propose a framework that uses AIGC not to replace human practitioners but to create supplementary educational materialsโ€”contextual explanations, interactive demonstrations, multilingual descriptionsโ€”that support rather than substitute for direct cultural transmission.

AI-Enhanced Heritage Learning

Srivastava, Gavhane, and Itnal (2026) survey the emerging field of AI-enhanced cultural heritage learning platforms. As museums and cultural institutions digitize their collections, the challenge shifts from making collections accessible to making them educationally effective. AI tools can personalize heritage learning experiencesโ€”recommending related objects, generating contextual narratives, adapting difficulty levelโ€”in ways that static catalog displays cannot.

The survey identifies both promise and risk: promise in personalization and engagement, risk in algorithmic filtering that may narrowly guide users toward popular items while obscuring less-known but culturally important objects. The "filter bubble" problem familiar from social media applies to heritage platforms as well.

Critical Analysis: Claims and Evidence

<
ClaimEvidenceVerdict
Standardized metadata schemas reduce cultural specificitySpuriล†a's comparative analysis of Baltic museum databasesโœ… Supported
LLMs can assist heritage metadata generationEl Ganadi et al.'s Digital Maktaba frameworkโš ๏ธ Uncertain โ€” works for standard fields, struggles with specialized heritage vocabulary
AIGC can create compelling ICH representationsXinyu et al.'s framework for Chinese ICHโš ๏ธ Uncertain โ€” visual quality possible, cultural authenticity unverified
AI personalization risks creating heritage "filter bubbles"Srivastava et al.'s survey analysisโš ๏ธ Uncertain โ€” plausible but not yet empirically demonstrated

Open Questions and Future Directions

  • Culturally adaptive metadata: Can metadata schemas be designed to accommodate cultural specificity without sacrificing interoperability? Multi-ontological systems that support parallel classification schemes (Western, Indigenous, Islamic) are technically feasible but organizationally complex.
  • Heritage data sovereignty: When cultural objects are digitized and stored on external platforms, who controls the data? The question is particularly acute for Indigenous and postcolonial collections.
  • Validation of AI-generated heritage content: How do we ensure that AIGC representations of cultural practices are accurate? Community validation processes are essential but not yet standardized.
  • Long-term preservation: Digital formats evolve; a VR experience or AI model built today may be unplayable in a decade. How do we ensure that digital heritage outlasts its platform?
  • What This Means for Your Research

    For heritage professionals, the semantic gap analysis provides a vocabulary for articulating what is lost in digitizationโ€”and for designing metadata schemas that minimize the loss.

    For computer scientists, heritage applications push AI tools in productive ways: specialized vocabularies, multilingual content, and cultural sensitivity requirements that general-purpose models do not address.

    Explore related work through ORAA ResearchBrain.

    References (4)

    [1] Spuriล†a, M. (2025). Digitization of Museum Objects and the Semantic Gap. Heritage, 8(9), 369.
    [2] El Ganadi, A., Gagliardelli, L., & Ruozzi, F. (2025). Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for Arabic digital libraries. International Journal on Digital Libraries.
    [3] Xinyu, H., Lei, Z., & Chen, B. (2024). The Advanced Path of Research on Digital Display Empowered by AIGC Technology for Intangible Cultural Heritage. Journal of Theory and Practice of Management.
    [4] Srivastava, S., Gavhane, S., & Itnal, V. (2026). AI-Enhanced Cultural Heritage Learning Platforms. ShodhKosh.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 7 keywords โ†’