Paper ReviewBiology & Life Sciences

hECA v2.0: Building an AI-Ready Atlas of 10.8 Million Human Cells

The human Ensemble Cell Atlas v2.0 assembles 10.8 million scRNA-seq cells and 1.45 million scATAC-seq profiles across 42 human tissues under a unified annotation framework, designed explicitly for training AI foundation models like scMulan.

By ORAA Research

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The single-cell genomics field has generated extraordinary volumes of data over the past decade, but this abundance has created its own problem: datasets are scattered across hundreds of studies, annotated with inconsistent vocabularies, processed with different pipelines, and formatted in incompatible ways. For training large AI models on cellular biology, this fragmentation is not merely inconvenient—it is a fundamental obstacle. The human Ensemble Cell Atlas (hECA) v2.0 (Xi et al., 2025), published in Scientific Data, attempts to address this by assembling, standardizing, and uniformly annotating the largest AI-ready single-cell resource to date.

The Data Integration Challenge

Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells, producing count matrices that vary in depth, quality, gene nomenclature, and cell type labeling depending on the generating laboratory. When different groups label cells as "T cells," "CD4+ T cells," "T helper cells," or "Th1 lymphocytes," computational integration struggles with what is fundamentally a semantic problem.

Previous atlas efforts—the Human Cell Atlas (HCA), CellxGene, and the first hECA version—made substantial progress in aggregating data but varied in the degree of annotation harmonization. Pan et al. (2024) in Genome Biology constructed a single-cell multi-omics encyclopedia spanning five omics modalities, demonstrating the value of cross-modal integration. However, the specific requirements of AI model pre-training—consistent tokenization of cell types, standardized feature spaces, quality-controlled expression matrices—demand additional curation beyond what general-purpose atlases provide.

What hECA v2.0 Contains

hECA v2.0 aggregates 10,831,024 human cells from scRNA-seq data and adds a new modality: 1,450,511 cells from single-cell ATAC sequencing (scATAC-seq), which profiles chromatin accessibility rather than gene expression. Together, these two modalities cover 42 human organs and tissues.

The data underwent several standardization steps:

Gene expression matrices were reprocessed to use a unified gene symbol set, eliminating discrepancies from different genome builds and annotation versions.
Chromatin accessibility profiles were aligned to a common peak set, enabling cross-dataset comparison of regulatory regions.
Cell type annotations were manually re-annotated using the Unified Hierarchical Annotation Framework (uHAF), which imposes a controlled vocabulary organized in a hierarchy from broad categories (e.g., "immune cell") to specific types (e.g., "CD8+ effector memory T cell, TEMRA").

The manual re-annotation aspect is particularly noteworthy. Automated annotation tools (CellTypist, scType) can propagate errors from reference datasets; the hECA team performed expert review to correct misannotations and enforce consistency across studies.

The AI-Ready Design Philosophy

The explicit targeting of AI model pre-training distinguishes hECA v2.0 from prior atlas efforts. The authors note that their dataset served as the pre-training corpus for scMulan, a large generative cellular AI model. This creates a direct feedback loop: the atlas is designed to support the model, and the model's performance validates the atlas's quality.

For AI applications, several design choices matter:

Consistent tokenization. Large language models for biology (scGPT, Geneformer, scBERT) typically tokenize gene expression by discretizing continuous values into bins or ranks. A unified feature space across the atlas ensures that these tokens have consistent meaning regardless of the source dataset.

Balanced tissue representation. Training data imbalance—where blood and brain tissues are overrepresented while rarer tissues like adrenal glands are scarce—can bias model behavior. hECA v2.0 documents the tissue distribution, enabling informed sampling strategies during training.

Dual-modics pairing. Having both transcriptomic and epigenomic (chromatin accessibility) data from human tissues enables models that learn relationships between gene regulation and expression—a richer biological representation than expression alone.

Critical Assessment

Coverage remains uneven. While 42 organs are represented, the depth of coverage varies substantially. Some tissues may have hundreds of thousands of cells while others have tens of thousands. This imbalance reflects the field's historical research priorities rather than biological importance.

Annotation quality depends on human expertise. The manual re-annotation is a strength for accuracy but a limitation for scalability. As new datasets are generated at accelerating rates, maintaining annotation quality through expert review becomes progressively more challenging. Automated methods with human-in-the-loop verification may be necessary for future versions.

Batch effects persist. Despite standardization, technical variation between datasets generated by different laboratories, using different protocols, and on different platforms cannot be fully eliminated. The atlas documents known batch structures, but downstream users must still apply batch correction methods.

The atlas is a snapshot. The Human Cell Atlas and similar projects continue to generate data. hECA v2.0 captures a specific moment in time; its utility depends on how frequently it is updated and whether the annotation framework remains consistent across versions.

Comparison with alternatives is needed. CellxGene Census from the Chan Zuckerberg Initiative also provides large-scale standardized single-cell data. A systematic comparison of coverage, annotation quality, and utility for AI model training between these resources would be valuable for the community.

Implications for Cellular AI Models

The emergence of AI-ready atlases signals a maturation of the single-cell field from data generation to data engineering. Just as ImageNet's curation enabled the deep learning explosion in computer vision, standardized cell atlases may play an analogous role for biological foundation models.

However, the analogy has limits. Images have a natural structure (pixel grids) that maps cleanly to neural network architectures. Single-cell data is sparse, noisy, and high-dimensional in ways that challenge standard architectures. The success of atlas-pretrained models will depend not only on data quality but on architectural innovations tailored to biological data characteristics.

Open Questions

How should AI model pre-training handle the inherent imbalance in tissue representation within atlas datasets?
Can automated annotation methods achieve sufficient quality to scale atlas construction beyond what expert manual review permits?
What is the optimal combination of omics modalities for pre-training biological foundation models—is dual-omics sufficient, or do protein, spatial, and perturbation data add critical information?
How should versioning and updating of AI-ready atlases be managed to ensure reproducibility of models trained on earlier versions?

Closing Reflection

hECA v2.0 represents a thoughtful effort to transform the single-cell data deluge into a structured resource suitable for training AI models. Its value lies not merely in scale—10.8 million cells is large but not substantial—but in the careful standardization and annotation that make those cells computationally comparable. As biological AI models grow in ambition, the quality of their training data will increasingly determine their ceiling. Atlas-engineering efforts like hECA v2.0 are laying that foundation.

단일 세포 유전체학 분야는 지난 10년간 방대한 양의 데이터를 생산해왔으나, 이러한 데이터의 풍부함은 그 자체로 새로운 문제를 야기했다. 데이터셋은 수백 개의 연구에 분산되어 있고, 일관성 없는 어휘로 주석이 달려 있으며, 서로 다른 파이프라인으로 처리되고, 호환되지 않는 방식으로 포맷되어 있다. 세포 생물학에 관한 대규모 AI 모델을 학습시키는 데 있어 이러한 단편화는 단순한 불편함에 그치지 않고, 근본적인 장애물이 된다. Scientific Data에 발표된 인간 앙상블 세포 아틀라스(hECA) v2.0(Xi et al., 2025)은 현재까지 가장 규모가 큰 AI 준비 완료 단일 세포 자원을 수집, 표준화, 균일하게 주석 처리함으로써 이 문제를 해결하고자 한다.

데이터 통합의 과제

단일 세포 RNA 시퀀싱(scRNA-seq)은 개별 세포의 유전자 발현을 측정하여 카운트 행렬을 생성하는데, 이 행렬은 생성 연구실에 따라 깊이, 품질, 유전자 명명법, 세포 유형 표지가 각기 다르다. 서로 다른 연구 그룹들이 동일한 세포를 "T 세포," "CD4+ T 세포," "T 보조 세포," 또는 "Th1 림프구"로 표지할 때, 계산적 통합은 근본적으로 의미론적 문제에 직면하게 된다.

이전의 아틀라스 구축 시도들—Human Cell Atlas(HCA), CellxGene, 그리고 첫 번째 hECA 버전—은 데이터 집계 측면에서 상당한 진전을 이루었으나, 주석 조화의 수준은 제각각이었다. Genome Biology에 발표된 Pan et al.(2024)은 5가지 오믹스 양식을 아우르는 단일 세포 다중 오믹스 백과사전을 구축하여 교차 양식 통합의 가치를 입증하였다. 그러나 AI 모델 사전 훈련의 특수한 요구 사항—세포 유형의 일관된 토큰화, 표준화된 특징 공간, 품질 관리된 발현 행렬—은 범용 아틀라스가 제공하는 것 이상의 추가적인 큐레이션을 필요로 한다.

hECA v2.0의 구성 내용

hECA v2.0은 scRNA-seq 데이터로부터 10,831,024개의 인간 세포를 집계하고, 유전자 발현 대신 염색질 접근성을 프로파일링하는 단일 세포 ATAC 시퀀싱(scATAC-seq)으로부터 1,450,511개의 세포를 포함하는 새로운 양식을 추가하였다. 이 두 가지 양식은 42개의 인간 장기 및 조직을 다룬다.

데이터는 다음과 같은 표준화 단계를 거쳤다:

유전자 발현 행렬은 통일된 유전자 기호 세트를 사용하도록 재처리되어, 서로 다른 게놈 빌드 및 주석 버전으로 인한 불일치를 제거하였다.
염색질 접근성 프로파일은 공통 피크 세트에 정렬되어, 데이터셋 간 조절 영역의 비교가 가능하도록 하였다.
세포 유형 주석은 통합 계층적 주석 프레임워크(uHAF)를 사용하여 수동으로 재주석 처리되었으며, 이는 광범위한 범주(예: "면역 세포")에서 구체적인 유형(예: "CD8+ 효과기 기억 T 세포, TEMRA")에 이르는 계층 구조로 조직된 통제 어휘를 적용한다.

수동 재주석 처리 측면은 특히 주목할 만하다. 자동 주석 도구(CellTypist, scType)는 참조 데이터셋의 오류를 전파할 수 있으므로, hECA 팀은 전문가 검토를 수행하여 잘못된 주석을 수정하고 연구들 간의 일관성을 확보하였다.

AI 준비 완료 설계 철학

AI 모델 사전 훈련을 명시적으로 겨냥한다는 점이 hECA v2.0을 이전의 아틀라스 시도들과 차별화한다. 저자들은 자신들의 데이터셋이 대규모 생성적 세포 AI 모델인 scMulan의 사전 훈련 코퍼스로 활용되었음을 밝히고 있다. 이는 직접적인 피드백 루프를 형성한다. 즉, 아틀라스는 모델을 지원하도록 설계되고, 모델의 성능은 아틀라스의 품질을 검증한다.

AI 응용 분야에서는 다음과 같은 몇 가지 설계 선택이 중요하다:

일관된 토큰화. 생물학을 위한 대규모 언어 모델(scGPT, Geneformer, scBERT)은 일반적으로 연속적인 값을 구간 또는 순위로 이산화하여 유전자 발현을 토큰화한다. 아틀라스 전반에 걸친 통일된 특징 공간은 소스 데이터셋에 관계없이 이러한 토큰들이 일관된 의미를 갖도록 보장한다. 균형 잡힌 조직 표현. 혈액 및 뇌 조직은 과대 대표되는 반면 부신과 같은 희귀 조직은 부족한, 훈련 데이터의 불균형은 모델 동작에 편향을 초래할 수 있다. hECA v2.0은 조직 분포를 문서화함으로써 훈련 과정에서 정보에 입각한 샘플링 전략을 수립할 수 있게 한다.

이중 모달리티 쌍. 인간 조직으로부터 전사체(transcriptomic) 및 후성유전체(epigenomic, 염색질 접근성) 데이터를 함께 보유하면, 유전자 조절과 발현 간의 관계를 학습하는 모델 구축이 가능해진다. 이는 발현 데이터만을 사용하는 것보다 훨씬 풍부한 생물학적 표현을 제공한다.

비판적 평가

커버리지는 여전히 불균등하다. 42개의 장기가 포함되어 있지만, 커버리지의 깊이는 상당한 차이를 보인다. 일부 조직은 수십만 개의 세포를 가지는 반면, 다른 조직은 수만 개에 그칠 수 있다. 이러한 불균형은 생물학적 중요성보다는 해당 분야의 역사적 연구 우선순위를 반영한 결과이다.

주석 품질은 인간 전문성에 의존한다. 수동 재주석(re-annotation)은 정확성 측면에서 강점이지만 확장성 측면에서는 한계이다. 새로운 데이터셋이 가속화된 속도로 생성됨에 따라, 전문가 검토를 통한 주석 품질 유지는 점점 더 어려워진다. 향후 버전에서는 인간이 루프에 참여하는(human-in-the-loop) 검증 방식을 갖춘 자동화된 방법이 필요할 수 있다.

배치 효과는 지속된다. 표준화에도 불구하고, 서로 다른 실험실에서 서로 다른 프로토콜과 플랫폼을 사용하여 생성된 데이터셋 간의 기술적 변이를 완전히 제거할 수는 없다. 아틀라스는 알려진 배치 구조를 문서화하고 있지만, 하류(downstream) 사용자는 여전히 배치 보정(batch correction) 방법을 적용해야 한다.

아틀라스는 특정 시점의 스냅샷이다. Human Cell Atlas 및 유사 프로젝트는 지속적으로 데이터를 생성하고 있다. hECA v2.0은 특정 시점을 포착한 것으로, 그 유용성은 얼마나 자주 업데이트되는지, 그리고 주석 프레임워크가 버전 간에 일관성을 유지하는지에 달려 있다.

대안과의 비교가 필요하다. Chan Zuckerberg Initiative의 CellxGene Census 역시 대규모 표준화된 단일세포 데이터를 제공한다. 이러한 자원들 간의 커버리지, 주석 품질, 그리고 AI 모델 훈련을 위한 유용성에 대한 체계적인 비교는 커뮤니티에 큰 가치를 지닐 것이다.

세포 AI 모델에 대한 함의

AI 친화적 아틀라스의 등장은 단일세포 분야가 데이터 생성에서 데이터 엔지니어링으로 성숙해가고 있음을 시사한다. ImageNet의 큐레이션이 컴퓨터 비전 분야에서 딥러닝의 폭발적 발전을 가능하게 했던 것처럼, 표준화된 세포 아틀라스는 생물학적 파운데이션 모델(foundation model)에서 유사한 역할을 할 수 있다.

그러나 이 유추에는 한계가 있다. 이미지는 신경망 아키텍처에 깔끔하게 매핑되는 자연스러운 구조(픽셀 격자)를 가지고 있다. 단일세포 데이터는 희소하고 잡음이 많으며 고차원적이어서 표준 아키텍처에 도전적이다. 아틀라스로 사전 훈련된 모델의 성공은 데이터 품질뿐만 아니라 생물학적 데이터 특성에 맞춰진 아키텍처 혁신에도 달려 있다.

열린 질문들

AI 모델 사전 훈련에서 아틀라스 데이터셋 내 조직 표현의 내재적 불균형을 어떻게 처리해야 하는가?
자동화된 주석 방법이 전문가 수동 검토로 가능한 수준을 넘어서 아틀라스 구축을 확장하기에 충분한 품질을 달성할 수 있는가?
생물학적 파운데이션 모델 사전 훈련을 위한 최적의 오믹스(omics) 모달리티 조합은 무엇인가—이중 오믹스(dual-omics)로 충분한가, 아니면 단백질, 공간(spatial), 그리고 섭동(perturbation) 데이터가 핵심적인 정보를 추가하는가?
이전 버전으로 훈련된 모델의 재현성을 보장하기 위해 AI 친화적 아틀라스의 버전 관리 및 업데이트를 어떻게 관리해야 하는가?

마무리 성찰

hECA v2.0은 단일세포 데이터의 홍수를 AI 모델 훈련에 적합한 구조화된 자원으로 변환하기 위한 사려 깊은 노력을 대표한다. 그 가치는 단순한 규모—1,080만 개의 세포는 크지만 압도적이지는 않다—에 있는 것이 아니라, 해당 세포들을 계산적으로 비교 가능하게 만드는 신중한 표준화와 주석에 있다. 생물학적 AI 모델의 야망이 커짐에 따라, 훈련 데이터의 품질이 모델의 한계를 점점 더 결정하게 될 것이다. hECA v2.0과 같은 아틀라스 엔지니어링 노력은 바로 그 토대를 닦고 있다.

References (3)

Xi, X., et al. (2025). hECA v2.0: an AI-ready ensemble cell atlas of single-cell RNA and ATAC sequencing data. Scientific Data.

DOI Scholar

Pan, L., et al. (2024). Single Cell Atlas: a single-cell multi-omics human cell encyclopedia. Genome Biology.

DOI Scholar

Xi, X., Chen, Y., Wu, X., Hao, M., Li, J., Bian, H., et al. (2025). hECA v2.0: an AI-ready ensemble cell atlas of single-cell RNA and ATAC sequencing data. Scientific Data, 13(1).

DOI Scholar

hECA v2.0: Building an AI-Ready Atlas of 10.8 Million Human Cells

The Data Integration Challenge

What hECA v2.0 Contains

The AI-Ready Design Philosophy

Critical Assessment

Implications for Cellular AI Models

Open Questions

Closing Reflection

데이터 통합의 과제

hECA v2.0의 구성 내용

AI 준비 완료 설계 철학

비판적 평가

세포 AI 모델에 대한 함의

열린 질문들

마무리 성찰

References (3)

Explore this topic deeper