Trend AnalysisMedicine & HealthMachine/Deep Learning

Medical Vision-Language Models: From CT Scans to Clinical Reports

Merlin, a CT vision-language model trained on 15,000+ CT scans with paired radiology reports, achieves 118 citations by demonstrating that foundation models can interpret abdominal CT at clinically useful accuracy. But the explainability gap and demographic biases in training data remain unresolved.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The volume of medical imaging studies performed globally—estimated at 3.6 billion annually—far exceeds the capacity of radiologists to interpret them in a timely manner. Abdominal CT scans alone account for hundreds of millions of studies per year, each producing hundreds of slices that must be examined for abnormalities in the liver, kidneys, pancreas, bowel, vasculature, and musculoskeletal system. The promise of medical vision-language models (VLMs) is to bridge this gap: systems that can view a CT scan and generate a clinically useful report, flagging abnormalities, quantifying findings, and suggesting differential diagnoses in natural language.

The promise is closer to reality than many clinicians realize—but further from clinical deployment than many AI researchers admit.

Merlin: The CT Foundation Model

Blankemeier et al. (2024) present Merlin, a vision-language foundation model specifically designed for abdominal CT interpretation, published in Nature with . The model is trained on a dataset of over 15,000 CT volumes paired with the corresponding radiology reports—one of the larger medical VLM training sets reported to date.

Merlin's architecture adapts the CLIP (Contrastive Language-Image Pre-training) paradigm for volumetric medical data. Unlike natural images, CT scans are three-dimensional volumes (typically 200–500 axial slices), requiring 3D convolutional encoders that can capture spatial relationships across slices—a tubular structure crossing multiple slices may be a blood vessel, a bile duct, or a tumor, and the distinction often depends on its 3D morphology.

The model demonstrates several clinically relevant capabilities:

Multi-organ abnormality detection: Identifying pathology across liver, kidneys, spleen, pancreas, and adrenal glands simultaneously—a task that requires the model to attend to different anatomical regions with different diagnostic thresholds.
Report generation: Producing free-text radiology reports that include pertinent positive and negative findings, measurements, and differential diagnoses.
Zero-shot and few-shot generalization: Performing reasonably on CT findings it was not explicitly trained on, guided by text descriptions.

The key performance metric: Merlin demonstrates improved report generation over prior models (RadFM) on standard metrics (RadGraph-F1, BERT Score, ROUGE-2, BLEU), though the authors describe this as 'an early demonstration' and note the model tends to under-report positive findings.

The State of the Field: A Systematic Review

Ryu et al. (2025) provide a systematic review of vision-language foundation models for medical imaging in Biomedical Engineering Letters with . Their survey covers the full spectrum of medical VLM architectures, training strategies, and clinical applications, identifying several trends:

Trend 1: Domain-specific models outperform general-purpose ones. Medical VLMs trained on medical data consistently outperform general-purpose VLMs (GPT-4V, Gemini) adapted for medical tasks. The gap is substantial for specialized imaging modalities (pathology, ophthalmoscopy, dermatoscopy) and smaller but still present for more common modalities (chest X-ray, CT).

Trend 2: Data quality matters more than data quantity. Models trained on 10,000 high-quality image-report pairs often outperform those trained on 100,000 noisy pairs. The curation of training data—ensuring that reports accurately describe the images, that diagnostic labels are correct, and that demographic representation is adequate—is labor-intensive but critical.

Trend 3: Evaluation remains inconsistent. Different papers use different metrics (BLEU, ROUGE, CheXpert F1, clinical concordance), different test sets, and different evaluation protocols, making cross-study comparison difficult. Ryu et al. call for standardized benchmarks analogous to ImageNet but specific to medical imaging.

3D Medical VLMs: Beyond Single Slices

Wu et al. (2025) address a specific limitation of many medical VLMs: they process 2D slices rather than 3D volumes. Published their work introduces a foundation model that natively processes 3D medical images—CT, MRI, and PET scans—without the information loss inherent in 2D projection or slice-by-slice processing.

The 3D approach matters for pathologies that are defined by their spatial extent: a liver mass is characterized by its three-dimensional shape, enhancement pattern across contrast phases, and relationship to adjacent vascular structures. A 2D model seeing one slice may classify a round lesion as a cyst; a 3D model seeing the full volume may recognize it as a metastasis with irregular margins and arterial-phase enhancement—a critical diagnostic distinction.

Wu et al. (2025) review 23 studies on 3D VLFMs for medical imaging, synthesizing evidence that tasks requiring spatial volumetric context—liver lesion characterization, lymph node assessment, pulmonary nodule evaluation—benefit from 3D-aware architectures over 2D patch-based approaches.

The Explainability Imperative

Nie et al. (2025) tackle what may be the greatest barrier to clinical adoption: explainability. Their approach, which has accumulated , introduces concept-enhanced vision-language pre-training—a technique where the model learns to ground its predictions in human-interpretable medical concepts (anatomical structures, pathological patterns, clinical findings) rather than in opaque feature vectors.

The idea is that a clinician presented with "the model predicts hepatocellular carcinoma because it detects arterial-phase hyperenhancement, washout on portal venous phase, and a capsule appearance" will trust that prediction more than one presented with "the model predicts hepatocellular carcinoma (confidence: 92%)." The first explanation maps onto established diagnostic criteria (LI-RADS); the second is a black box.

Nie et al.'s concept-enhanced approach achieves diagnostic accuracy comparable to non-explainable models while producing concept-level explanations that radiologists rate as "clinically meaningful" in user studies. The trade-off: the concept vocabulary is fixed at training time, meaning the model cannot explain predictions involving findings outside its concept dictionary.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Medical VLMs achieve improved CT report generation over prior models	Merlin outperforms RadFM on RadGraph-F1, BERT Score, ROUGE-2, BLEU; authors note under-reporting of positive findings (Blankemeier et al.)	✅ Supported (early demonstration)
Domain-specific VLMs outperform general-purpose VLMs	Consistent finding across multiple studies (Ryu et al.)	✅ Supported
3D-aware architectures benefit volumetric diagnostic tasks	Review of 23 studies synthesizes evidence for liver lesion, lymph node, and nodule evaluation (Wu et al., review)	✅ Supported (review-based)
Explainable VLMs maintain accuracy while providing interpretability	Concept-enhanced approach matches non-explainable baselines (Nie et al.)	✅ Supported (early results)
Medical VLMs are ready for autonomous clinical deployment	No prospective clinical trial; regulatory pathway undefined	❌ Refuted

The Demographic Bias Concern

A recurring issue across medical VLM research: training datasets are overwhelmingly drawn from North American and European academic medical centers, with demographic compositions that do not reflect global patient populations. Diseases that disproportionately affect underrepresented populations—hepatocellular carcinoma in East Asia, tuberculosis in sub-Saharan Africa, rheumatic heart disease in South Asia—may receive systematically lower diagnostic accuracy.

Merlin's training data comes from a single US academic institution. Whether its performance generalizes to CT scans acquired on different scanner models, with different contrast protocols, in patients with different body habitus distributions, is an empirical question that has not been systematically evaluated.

Open Questions and Future Directions

What regulatory framework applies to AI-generated radiology reports? Current FDA guidance covers AI as a diagnostic aid but does not address AI-generated reports intended to replace (rather than supplement) radiologist interpretation.

Can federated learning diversify training data? Training VLMs on data from institutions across continents without sharing patient data could address demographic bias. The effectiveness of federated learning for large VLMs is still being evaluated.

How should VLMs handle uncertainty? A radiologist encountering an ambiguous finding writes "cannot exclude malignancy" and recommends follow-up. VLMs must learn to express diagnostic uncertainty in clinically appropriate ways rather than producing binary classifications.

What is the medicolegal liability for AI-generated reports? If a VLM-generated report misses a cancer, who is liable—the AI developer, the hospital, or the supervising radiologist?

Can VLMs integrate longitudinal data? Comparing current and prior imaging studies is a routine part of radiology. VLMs that can process temporal sequences of images and detect interval changes would be substantially more clinically useful than single-study models.

Implications for Radiology

Medical vision-language models are approaching the point where they can perform routine interpretation tasks at clinically useful accuracy levels. The technology is mature enough to warrant prospective clinical trials—and several are now being planned or underway.

The transformation of radiology practice will not be sudden. It will be incremental: AI handling the high-volume, low-complexity studies (normal chest CTs, straightforward abdominal scans) while radiologists focus on complex cases, interventional procedures, and clinical consultation. The radiologists who engage with these tools early—learning their strengths, understanding their failure modes, contributing to their training datasets—will be better positioned than those who view them as either a threat or a curiosity.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 검증해야 한다.

의료 비전-언어 모델: CT 스캔에서 임상 보고서까지

전 세계적으로 수행되는 의료 영상 검사의 양은 연간 약 36억 건으로 추정되며, 이는 방사선과 의사들이 적시에 판독할 수 있는 역량을 훨씬 초과한다. 복부 CT 스캔만 해도 연간 수억 건에 달하며, 각 검사는 수백 장의 슬라이스로 구성되어 간, 신장, 췌장, 장, 혈관계, 근골격계의 이상 여부를 검토해야 한다. 의료 비전-언어 모델(VLM)의 잠재력은 바로 이 격차를 해소하는 데 있다. 즉, CT 스캔을 보고 임상적으로 유용한 보고서를 생성하고, 이상 소견을 표시하고, 소견을 정량화하며, 자연어로 감별 진단을 제안할 수 있는 시스템이다.

이 잠재력은 많은 임상의가 인식하는 것보다 현실에 더 가까이 다가와 있지만, 많은 AI 연구자들이 인정하는 것보다는 임상 배포와 더 멀리 떨어져 있다.

Merlin: CT 기반 모델

Blankemeier 등(2024)은 복부 CT 판독을 위해 특별히 설계된 비전-언어 기반 모델인 Merlin을 Nature에 발표하였다. 이 모델은 15,000건 이상의 CT 볼륨과 이에 대응하는 방사선과 보고서로 구성된 데이터셋으로 훈련되었으며, 이는 현재까지 보고된 의료 VLM 훈련 데이터셋 중 가장 규모가 큰 축에 속한다.

Merlin의 아키텍처는 CLIP(Contrastive Language-Image Pre-training) 패러다임을 체적 의료 데이터에 맞게 적용한 것이다. 일반 이미지와 달리 CT 스캔은 3차원 볼륨(일반적으로 200~500장의 축상 슬라이스)이므로, 슬라이스 간 공간적 관계를 포착할 수 있는 3D 합성곱 인코더가 필요하다. 여러 슬라이스에 걸쳐 나타나는 관상 구조물은 혈관일 수도 있고, 담관일 수도 있으며, 종양일 수도 있는데, 이를 구별하는 것은 종종 3D 형태에 달려 있다.

이 모델은 임상적으로 관련된 몇 가지 기능을 시연한다:

다중 장기 이상 소견 탐지: 간, 신장, 비장, 췌장, 부신의 병리 소견을 동시에 식별하는 것으로, 서로 다른 진단 역치를 가진 다양한 해부학적 영역에 주의를 기울여야 하는 작업이다.
보고서 생성: 양성 및 음성 소견, 측정값, 감별 진단을 포함하는 자유 형식의 방사선과 보고서를 생성한다.
제로샷 및 퓨샷 일반화: 텍스트 설명을 활용하여 명시적으로 훈련되지 않은 CT 소견에 대해 합리적인 성능을 발휘한다.

핵심 성능 지표: Merlin은 표준 지표(RadGraph-F1, BERT Score, ROUGE-2, BLEU)에서 기존 모델(RadFM)보다 향상된 보고서 생성 성능을 보이지만, 저자들은 이를 '초기 시연'으로 설명하며 해당 모델이 양성 소견을 과소 보고하는 경향이 있다고 밝히고 있다.

분야의 현황: 체계적 문헌 고찰

Ryu 등(2025)은 Biomedical Engineering Letters에 의료 영상을 위한 비전-언어 기반 모델에 대한 체계적 문헌 고찰을 발표하였다. 이 서베이는 의료 VLM 아키텍처, 훈련 전략, 임상 적용의 전반적인 스펙트럼을 다루며, 다음과 같은 몇 가지 동향을 제시한다:

동향 1: 도메인 특화 모델이 범용 모델보다 우수하다. 의료 데이터로 훈련된 의료 VLM은 의료 과제에 적용된 범용 VLM(GPT-4V, Gemini)보다 일관되게 우수한 성능을 보인다. 이 격차는 특수 영상 모달리티(병리학, 안저 검사, 피부경 검사)에서 크게 나타나며, 보다 일반적인 모달리티(흉부 X선, CT)에서는 작지만 여전히 존재한다.

동향 2: 데이터 품질이 데이터 양보다 중요하다. 10,000쌍의 고품질 이미지-보고서 쌍으로 훈련된 모델이 100,000쌍의 노이즈가 많은 데이터로 훈련된 모델보다 우수한 성능을 보이는 경우가 많다. 훈련 데이터의 정제, 즉 보고서가 이미지를 정확하게 설명하는지, 진단 레이블이 정확한지, 인구통계학적 대표성이 충분한지 확인하는 작업은 노동 집약적이지만 매우 중요하다. 트렌드 3: 평가 방식이 일관되지 않는다. 논문마다 서로 다른 지표(BLEU, ROUGE, CheXpert F1, 임상적 일치도)와 서로 다른 테스트 세트, 서로 다른 평가 프로토콜을 사용하기 때문에 연구 간 비교가 어렵다. Ryu et al.은 ImageNet에 준하되 의료 영상에 특화된 표준화된 벤치마크를 마련할 것을 촉구한다.

3D 의료 VLM: 단일 슬라이스를 넘어서

Wu et al. (2025)은 많은 의료 VLM이 가진 특정 한계를 다룬다. 바로 3D 볼륨이 아닌 2D 슬라이스를 처리한다는 점이다. 이 연구에서 소개된 파운데이션 모델은 CT, MRI, PET 스캔 등 3D 의료 영상을 2D 투영이나 슬라이스별 처리에서 필연적으로 발생하는 정보 손실 없이 기본적으로 처리한다.

3D 방식은 공간적 범위에 의해 정의되는 병리에서 중요하다. 간 종괴는 3차원적 형태, 조영 단계에 걸친 조영 패턴, 그리고 인접한 혈관 구조와의 관계로 특성이 파악된다. 하나의 슬라이스를 보는 2D 모델은 원형 병변을 낭종으로 분류할 수 있지만, 전체 볼륨을 보는 3D 모델은 불규칙한 경계와 동맥기 조영 증강을 가진 전이성 병변으로 인식할 수 있다. 이는 임상적으로 중요한 진단적 차이이다.

Wu et al. (2025)은 의료 영상을 위한 3D VLFM에 관한 23편의 연구를 검토하며, 공간적 체적 맥락이 요구되는 과제—간 병변 특성화, 림프절 평가, 폐결절 평가—에서 3D 인식 아키텍처가 2D 패치 기반 방식보다 유리하다는 근거를 종합한다.

설명 가능성의 필요성

Nie et al. (2025)은 임상 도입의 가장 큰 장벽으로 꼽히는 설명 가능성을 다룬다. 이들의 접근법은 개념 강화 시각-언어 사전 훈련(concept-enhanced vision-language pre-training)을 도입한다. 이는 모델이 불투명한 특징 벡터가 아닌 인간이 해석 가능한 의료 개념(해부학적 구조, 병리학적 패턴, 임상 소견)에 예측을 근거하도록 학습시키는 기법이다.

핵심 아이디어는 다음과 같다. 임상의에게 "모델이 동맥기 과조영, 문맥기 세척 효과, 피막 소견을 감지하여 간세포암종으로 예측한다"고 제시하면, "모델이 간세포암종으로 예측한다(신뢰도: 92%)"고 제시하는 것보다 더 신뢰를 얻을 수 있다. 전자의 설명은 확립된 진단 기준(LI-RADS)에 대응되지만, 후자는 블랙박스에 불과하다.

Nie et al.의 개념 강화 방식은 설명 불가능한 모델과 비교 가능한 진단 정확도를 달성하는 동시에, 사용자 연구에서 방사선과 의사들이 "임상적으로 의미 있다"고 평가한 개념 수준의 설명을 생성한다. 단, 개념 어휘는 훈련 시점에 고정되므로 모델이 자신의 개념 사전에 없는 소견과 관련된 예측을 설명할 수 없다는 한계가 있다.

비판적 분석: 주장과 근거

주장	근거	판정
의료 VLM이 기존 모델 대비 CT 보고서 생성 성능을 향상시킨다	Merlin이 RadGraph-F1, BERT Score, ROUGE-2, BLEU에서 RadFM을 능가하며, 저자들은 양성 소견의 과소 보고를 지적함 (Blankemeier et al.)	✅ 지지됨 (초기 입증)
도메인 특화 VLM이 범용 VLM보다 우수하다	다수의 연구에서 일관되게 발견됨 (Ryu et al.)	✅ 지지됨
3D 인식 아키텍처가 체적 진단 과제에 유리하다	간 병변, 림프절, 결절 평가에 관한 23편의 연구를 검토하여 근거를 종합함 (Wu et al., 리뷰)	✅ 지지됨 (리뷰 기반)
설명 가능한 VLM이 해석 가능성을 제공하면서도 정확도를 유지한다	개념 강화 방식이 설명 불가능한 기준 모델과 동등한 성능을 보임 (Nie et al.)	✅ 지지됨 (초기 결과)
의료 VLM이 자율적 임상 배치에 준비되어 있다	전향적 임상시험 없음; 규제 경로 미정	❌ 반박됨

인구통계학적 편향 우려

의료 VLM 연구 전반에 걸쳐 반복적으로 제기되는 문제가 있다: 훈련 데이터셋이 압도적으로 북미 및 유럽의 학술 의료 센터에서 수집되며, 그 인구통계학적 구성이 전 세계 환자 집단을 반영하지 못한다는 점이다. 과소 대표된 집단에서 불균형적으로 발생하는 질환들—동아시아의 간세포암종(hepatocellular carcinoma), 사하라 이남 아프리카의 결핵(tuberculosis), 남아시아의 류마티스성 심장 질환(rheumatic heart disease)—은 체계적으로 낮은 진단 정확도를 보일 수 있다.

Merlin의 훈련 데이터는 단일 미국 학술 기관에서 수집된 것이다. 그 성능이 서로 다른 스캐너 모델, 다른 조영 프로토콜(contrast protocols), 다른 체형 분포를 가진 환자들을 대상으로 획득한 CT 스캔에도 일반화될 수 있는지는 아직 체계적으로 평가되지 않은 실증적 질문이다.

미해결 과제 및 향후 방향

AI가 생성한 영상의학 보고서에는 어떤 규제 체계가 적용되는가? 현행 FDA 지침은 진단 보조 도구로서의 AI를 다루고 있으나, 영상의학과 전문의의 판독을 보완하는 것이 아니라 대체하기 위한 AI 생성 보고서는 다루고 있지 않다.

연합 학습(federated learning)이 훈련 데이터를 다양화할 수 있는가? 환자 데이터를 공유하지 않고 여러 대륙의 기관 데이터로 VLM을 훈련시키는 방식은 인구통계학적 편향을 해소할 수 있다. 대규모 VLM에 대한 연합 학습의 효과는 아직 평가 중이다.

VLM은 불확실성을 어떻게 처리해야 하는가? 영상의학과 전문의는 모호한 소견을 접할 때 "악성 종양을 배제할 수 없음"이라고 기재하고 추적 관찰을 권고한다. VLM은 이분법적 분류를 생성하는 대신 임상적으로 적절한 방식으로 진단적 불확실성을 표현하는 법을 학습해야 한다.

AI 생성 보고서에 대한 의료법적 책임은 누구에게 있는가? VLM이 생성한 보고서에서 암을 놓쳤을 경우, 그 책임은 AI 개발사, 병원, 감독 영상의학과 전문의 중 누구에게 있는가?

VLM은 종단적(longitudinal) 데이터를 통합할 수 있는가? 현재 및 이전 영상 검사를 비교하는 것은 영상의학의 일상적인 과정이다. 시간적 순서로 구성된 영상 시퀀스를 처리하고 경과 중 변화를 감지할 수 있는 VLM은 단일 검사 기반 모델보다 임상적으로 훨씬 더 유용할 것이다.

영상의학에 대한 시사점

의료 비전-언어 모델(vision-language model)은 임상적으로 유용한 수준의 정확도로 일상적인 판독 업무를 수행할 수 있는 단계에 근접하고 있다. 이 기술은 전향적 임상 시험을 정당화할 만큼 충분히 성숙해 있으며, 실제로 여러 시험이 현재 계획 중이거나 진행 중이다.

영상의학 진료의 변화는 갑작스럽게 이루어지지 않을 것이다. 이는 점진적으로 진행될 것이다: AI가 대용량·저복잡도 검사(정상 흉부 CT, 단순 복부 스캔)를 처리하는 동안, 영상의학과 전문의는 복잡한 증례, 중재적 시술, 임상 자문에 집중하게 된다. 이러한 도구들을 조기에 활용하며—강점을 익히고, 오류 양상을 이해하고, 훈련 데이터셋 구축에 기여하는—영상의학과 전문의는, 이를 위협으로 보거나 단순한 호기심의 대상으로 여기는 이들보다 더 유리한 위치에 서게 될 것이다.

References (4)

[1] Blankemeier, L., Cohen, J., Kumar, A. et al. (2024). Merlin: A computed tomography vision-language foundation model and dataset. Nature, 637, 943–951.

DOI Scholar

[2] Ryu, J., Kang, H., Chu, Y. et al. (2025). Vision-language foundation models for medical imaging: A review of current practices and innovations. Biomedical Engineering Letters, 15(5), 809–830.

DOI Scholar

[3] Wu, J., Wang, Y., Zhong, Z. et al. (2025). Vision-language foundation model for 3D medical imaging. npj Artificial Intelligence, 3, 15.

DOI Scholar

[4] Nie, Y., He, S., Bie, Y. et al. (2025). An explainable biomedical foundation model via large-scale concept-enhanced vision-language pre-training. arXiv preprint.

Scholar

Medical Vision-Language Models: From CT Scans to Clinical Reports

Merlin: The CT Foundation Model

The State of the Field: A Systematic Review

3D Medical VLMs: Beyond Single Slices

The Explainability Imperative

Critical Analysis: Claims and Evidence

The Demographic Bias Concern

Open Questions and Future Directions

Implications for Radiology

의료 비전-언어 모델: CT 스캔에서 임상 보고서까지

Merlin: CT 기반 모델

분야의 현황: 체계적 문헌 고찰

3D 의료 VLM: 단일 슬라이스를 넘어서

설명 가능성의 필요성

비판적 분석: 주장과 근거

인구통계학적 편향 우려

미해결 과제 및 향후 방향

영상의학에 대한 시사점

References (4)

Explore this topic deeper