Paper ReviewAI & Machine LearningMachine/Deep Learning

Vision-Language Foundation Models in Precision Oncology

A Nature paper on vision-language foundation models for cancer diagnosis signals that multimodal medical AI has crossed from research curiosity to clinical necessity.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Among the most notable AI papers of early 2025 to appear in Nature, Xiang et al.'s vision-language foundation model for precision oncology demonstrates that a single multimodal model, trained to jointly understand medical images and clinical text, can match or exceed specialist performance across multiple cancer types.

The trajectory from research demonstration to clinical infrastructure is accelerating.

The medical AI of 2020–2023 was overwhelmingly unimodal. A radiology model analyzed X-rays. A pathology model examined tissue slides. A clinical NLP model processed physician notes. Each operated in isolation, unable to synthesize the multimodal information that defines real clinical reasoning—where a radiologist interprets a scan in the context of lab results, patient history, and the referring physician's clinical question.

Vision-language foundation models dissolve these boundaries. By pre-training on massive paired datasets of medical images and their associated clinical text—radiology reports, pathology descriptions, surgical notes—these models learn representations that bridge visual and linguistic modalities. The result is a system that can answer questions like "Is the mass in the upper right lobe consistent with the patient's history of adenocarcinoma?" by jointly reasoning over the CT scan and the clinical narrative.

Xiang et al.'s contribution is distinguished by scale and clinical validation. Their model was pretrained on large-scale pathology image and text datasets using unified masked modelling on unlabelled, unpaired data spanning multiple cancer types and imaging modalities. Crucially, the validation was performed on held-out clinical cohorts with pathologically confirmed diagnoses—the gold standard that separates genuine clinical AI from benchmark-chasing.

Beyond Cancer: Ophthalmology and 3D Imaging

The vision-language paradigm is proliferating across medical specialties at remarkable speed.

EyeCLIP (Shi et al.) adapts the approach to ophthalmology, where the challenge is not merely detecting disease but detecting rare disease. Fundus photography and optical coherence tomography generate images where common conditions (diabetic retinopathy, glaucoma) dominate training data while rare conditions (Stargardt disease, retinal dystrophies) are severely underrepresented. EyeCLIP addresses this through vision-language pre-training that transfers knowledge from textual descriptions of rare conditions to visual recognition—even when few training images exist.

Wu et al. extend the paradigm to three-dimensional medical imaging—CT, MRI, and PET scans that existing 2D-focused VLMs cannot natively handle. Their 3D vision-language model processes volumetric data directly, avoiding the information loss inherent in projecting 3D scans to 2D slices. The clinical implications are substantial: many diagnostic findings—pulmonary nodule growth patterns, brain tumor margins, cardiac chamber volumes—are inherently three-dimensional.

The Explainability Imperative

A foundation model that diagnoses cancer accurately but inexplicably will not be adopted by clinicians. This is not a hypothetical concern—it is the primary barrier to clinical deployment of AI across virtually every medical specialty.

Nie et al. tackle this directly with their concept-enhanced vision-language pre-training approach. Rather than learning opaque visual features, their model is trained to associate images with interpretable clinical concepts—specific pathological patterns, anatomical landmarks, and diagnostic criteria that clinicians use in their own reasoning. When the model predicts malignancy, it can articulate which visual features contributed to the prediction in terms a pathologist understands.

Van Veldhuizen et al.'s comprehensive review frames the broader landscape of foundation models in medical imaging, examining how FMs are changing image analysis by learning from large collections of unlabeled data. The review situates concept-grounded approaches like Nie et al.'s within the broader spectrum of explainability strategies—from post-hoc attribution methods to architectures designed for inherent interpretability.

The Uncomfortable Questions

Does Performance Generalize Across Populations?

The 178-citation oncology model was validated on specific clinical cohorts. But cancer presents differently across populations—in prevalence, morphology, and clinical context. A model trained predominantly on data from academic medical centers in high-income countries may fail when deployed in low-resource settings where disease presentation, imaging equipment quality, and clinical workflows differ substantially.

No paper in this cohort adequately addresses this generalization challenge. It remains the elephant in the room of medical foundation models.

Who Bears Liability?

When a vision-language model misses a cancer diagnosis, who is responsible? The clinician who relied on it? The hospital that deployed it? The developers who trained it? The regulatory framework for AI-assisted diagnosis remains fragmented across jurisdictions, and foundation models—which are adapted rather than purpose-built for specific clinical tasks—fit poorly into existing regulatory categories designed for single-purpose medical devices.

What Happens to Clinical Skill?

If clinicians increasingly rely on AI for initial interpretation, will the next generation of radiologists and pathologists develop the deep visual expertise that currently defines their profession? The automation paradox suggests that as AI handles routine cases, human experts may lose proficiency precisely when they are most needed—on the rare, ambiguous cases that AI handles poorly.

Claims and Evidence

Claim	Evidence	Verdict
VLMs match specialist performance in cancer diagnosis	Xiang et al. demonstrate parity on validated clinical cohorts	✅ Supported (specific cohorts)
VLMs generalize across populations and settings	No cross-population validation published	⚠️ Unsubstantiated
Explainability is required for clinical adoption	Survey evidence from clinicians consistently confirms this	✅ Strongly supported
Concept-grounded models are more interpretable	Nie et al. show concept alignment improves explanation quality	✅ Supported (early evidence)
3D VLMs outperform 2D slice-based approaches	Wu et al. demonstrate improvement on volumetric tasks	✅ Supported

Open Questions

Foundation model regulation: Should medical VLMs be regulated as medical devices, software, or a new category? The FDA's evolving framework has not yet provided clear guidance for foundation models adapted to multiple clinical tasks.

Data sovereignty: Medical VLMs require massive training datasets. Who owns the clinical data? How do we balance the public health benefits of AI development against patient privacy rights?

Calibration: A model that is 95% accurate but 99% confident is more dangerous than one that is 90% accurate and correctly calibrated. How well calibrated are medical VLMs, and does calibration transfer across domains?

Update mechanisms: Medical knowledge evolves. How do we update deployed foundation models with new clinical evidence without catastrophic forgetting of established knowledge?

Integration pathways: The gap between a published model and a tool integrated into clinical workflows (PACS, EHR, CDSS) is enormous. What infrastructure is needed to bridge it?

What This Means for Your Research

If you work in medical AI, the vision-language foundation model paradigm is now the dominant approach—and for good reason. The ability to jointly reason over images and text mirrors clinical cognition in a way that unimodal approaches cannot. But three cautions are warranted.

First, validation on diverse populations is non-negotiable. A model validated only on data from tertiary academic centers is not ready for deployment, regardless of benchmark performance.

Second, explainability is not optional. The concept-grounded approach (Nie et al.) represents the most clinically credible path forward, but requires substantial domain expertise to implement correctly.

Third, the oncology model is impressive but limited in scope—one model, on one set of cancer types, validated on specific cohorts. The gap between this achievement and a universally deployable medical AI remains vast.

The researchers who advance this field will be those who resist the temptation to optimize for benchmarks and instead optimize for the messy, complicated, ethically fraught reality of clinical medicine.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 검증해야 한다.

178회 인용 및 증가 중: 정밀 종양학에서의 비전-언어 파운데이션 모델

2025년 초 Nature에 게재된 가장 주목할 만한 AI 논문 중 하나인 Xiang et al.의 정밀 종양학을 위한 비전-언어 파운데이션 모델은, 의료 이미지와 임상 텍스트를 공동으로 이해하도록 훈련된 단일 멀티모달 모델이 다양한 암 유형에 걸쳐 전문의 수준의 성능에 필적하거나 이를 초월할 수 있음을 입증한다.

연구 시연에서 임상 인프라로의 전환이 가속화되고 있다.

아키텍처의 전환: 단일 모달에서 통합 이해로

2020–2023년의 의료 AI는 압도적으로 단일 모달 방식이었다. 방사선학 모델은 X선을 분석했다. 병리학 모델은 조직 슬라이드를 검사했다. 임상 NLP 모델은 의사 소견을 처리했다. 각 모델은 독립적으로 작동하며, 실제 임상적 추론을 정의하는 멀티모달 정보를 종합할 수 없었다. 실제 임상 현장에서는 방사선과 의사가 검사 결과, 환자 병력, 그리고 의뢰 의사의 임상적 질문이라는 맥락 속에서 영상을 해석한다.

비전-언어 파운데이션 모델은 이러한 경계를 허문다. 의료 이미지와 관련 임상 텍스트—방사선 판독 보고서, 병리 소견, 수술 기록—로 구성된 대규모 쌍(paired) 데이터셋으로 사전 훈련함으로써, 이 모델들은 시각적 모달리티와 언어적 모달리티를 연결하는 표현을 학습한다. 그 결과, CT 스캔과 임상 서술을 함께 추론하여 "우상엽의 종괴가 환자의 선암종 병력과 일치하는가?"와 같은 질문에 답할 수 있는 시스템이 구현된다.

Xiang et al.의 연구는 규모와 임상 검증 측면에서 두드러진다. 그들의 모델은 여러 암 유형과 영상 모달리티에 걸친 레이블이 없는 비쌍(unpaired) 데이터에 대해 통합 마스킹 모델링(unified masked modelling)을 사용하여 대규모 병리 이미지 및 텍스트 데이터셋으로 사전 훈련되었다. 특히 중요한 점은, 검증이 병리학적으로 확진된 진단을 보유한 별도의 임상 코호트에서 수행되었다는 것이다. 이는 진정한 임상 AI와 단순한 벤치마크 추구를 구별하는 금본위 기준이다.

암을 넘어서: 안과학과 3D 영상

비전-언어 패러다임은 놀라운 속도로 다양한 의료 전문 분야에 확산되고 있다.

EyeCLIP (Shi et al.)은 이 접근 방식을 안과학에 적용한다. 안과학에서의 과제는 단순히 질환을 감지하는 것이 아니라, 희귀 질환을 감지하는 것이다. 안저 촬영과 광간섭단층촬영(OCT)은 일반적인 질환(당뇨망막병증, 녹내장)이 훈련 데이터를 지배하는 반면, 희귀 질환(Stargardt병, 망막 이영양증)은 심각하게 과소 대표되는 이미지를 생성한다. EyeCLIP은 희귀 질환의 텍스트 설명으로부터 시각적 인식으로 지식을 전이하는 비전-언어 사전 훈련을 통해 이 문제를 해결한다. 훈련 이미지가 거의 없는 경우에도 마찬가지이다.

Wu et al.은 이 패러다임을 3차원 의료 영상—기존의 2D 중심 VLM이 기본적으로 처리할 수 없는 CT, MRI, PET 스캔—으로 확장한다. 그들의 3D 비전-언어 모델은 체적(volumetric) 데이터를 직접 처리함으로써, 3D 스캔을 2D 슬라이스로 투영할 때 발생하는 정보 손실을 방지한다. 임상적 함의는 상당하다. 폐 결절의 성장 패턴, 뇌종양의 경계, 심장 방의 용적 등 많은 진단적 소견은 본질적으로 3차원적이기 때문이다.

설명 가능성의 필수성

정확하게 암을 진단하지만 그 근거를 설명할 수 없는 파운데이션 모델은 임상의들에게 채택되지 않을 것이다. 이는 가상의 우려가 아니다. 이것은 사실상 모든 의료 전문 분야에서 AI의 임상 배포를 가로막는 주된 장벽이다. Nie et al.은 개념 강화 비전-언어 사전 훈련 접근법을 통해 이 문제를 직접 다룬다. 불투명한 시각적 특징을 학습하는 대신, 이들의 모델은 이미지를 해석 가능한 임상 개념—임상의가 자신의 추론에 사용하는 특정 병리학적 패턴, 해부학적 랜드마크, 진단 기준—과 연결하도록 훈련된다. 모델이 악성 여부를 예측할 때, 병리의사가 이해할 수 있는 용어로 어떤 시각적 특징이 예측에 기여했는지 설명할 수 있다.

Van Veldhuizen et al.의 포괄적인 리뷰는 의료 영상에서 foundation model의 광범위한 지형을 정리하며, FM이 레이블이 없는 대규모 데이터로부터 학습함으로써 영상 분석을 어떻게 변화시키고 있는지 검토한다. 이 리뷰는 Nie et al.과 같은 개념 기반 접근법을 사후 귀인 방법부터 내재적 해석 가능성을 위해 설계된 아키텍처에 이르는 설명 가능성 전략의 광범위한 스펙트럼 안에 위치시킨다.

불편한 질문들

성능이 다양한 집단에 걸쳐 일반화되는가?

178개 인용 문헌을 보유한 종양학 모델은 특정 임상 코호트에서 검증되었다. 그러나 암은 유병률, 형태, 임상적 맥락에서 집단마다 다르게 나타난다. 고소득 국가의 학술 의료 센터 데이터를 주로 사용해 훈련된 모델은, 질병 양상·영상 장비 품질·임상 워크플로우가 크게 다른 저자원 환경에 배포될 때 실패할 수 있다.

이 코호트의 어떤 논문도 이 일반화 과제를 충분히 다루지 않는다. 이는 의료 foundation model 분야에서 여전히 방 안의 코끼리로 남아 있다.

책임은 누가 지는가?

비전-언어 모델이 암 진단을 놓쳤을 때, 누가 책임을 지는가? 이를 신뢰한 임상의인가? 배포한 병원인가? 훈련시킨 개발자인가? AI 보조 진단에 대한 규제 체계는 관할권마다 파편화되어 있으며, 특정 임상 과제를 위해 목적 설계된 것이 아니라 적응된 형태로 사용되는 foundation model은 단일 목적 의료기기를 위해 설계된 기존 규제 범주에 잘 들어맞지 않는다.

임상 역량은 어떻게 되는가?

임상의가 초기 판독을 AI에 점점 더 의존하게 된다면, 다음 세대 방사선과 의사와 병리의사는 현재 그들의 직업을 정의하는 깊은 시각적 전문성을 개발할 수 있을 것인가? 자동화 역설은 AI가 일상적인 사례를 처리함에 따라, 인간 전문가가 가장 필요한 순간—AI가 제대로 처리하지 못하는 희귀하고 모호한 사례들—에 오히려 숙련도를 잃을 수 있음을 시사한다.

주장과 근거

주장	근거	판정
VLM이 암 진단에서 전문의 수준의 성능과 동등하다	Xiang et al.이 검증된 임상 코호트에서 동등성을 입증	✅ 지지됨 (특정 코호트)
VLM이 다양한 집단과 환경에 걸쳐 일반화된다	집단 간 교차 검증 연구 미발표	⚠️ 근거 불충분
설명 가능성이 임상 도입에 필수적이다	임상의 대상 설문 연구가 일관되게 이를 확인	✅ 강하게 지지됨
개념 기반 모델이 더 해석 가능하다	Nie et al.이 개념 정렬이 설명 품질을 향상시킴을 보여줌	✅ 지지됨 (초기 근거)
3D VLM이 2D 슬라이스 기반 접근법보다 우수하다	Wu et al.이 체적 과제에서 향상을 입증	✅ 지지됨

미해결 질문들

Foundation model 규제: 의료 VLM은 의료기기, 소프트웨어, 또는 새로운 범주로 규제되어야 하는가? FDA의 진화하는 규제 체계는 다수의 임상 과제에 적응되는 foundation model에 대해 아직 명확한 지침을 제시하지 않고 있다.

데이터 주권: 의료 VLM은 대규모 훈련 데이터셋을 필요로 한다. 임상 데이터는 누구의 소유인가? AI 개발의 공중 보건적 이익과 환자 개인정보 보호 권리 사이의 균형을 어떻게 맞출 것인가?

캘리브레이션: 95% 정확도에 99% 신뢰도를 보이는 모델은 90% 정확도에 올바르게 캘리브레이션된 모델보다 더 위험하다. 의료 VLM은 얼마나 잘 캘리브레이션되어 있으며, 도메인 간에 캘리브레이션이 전이되는가?

업데이트 메커니즘: 의학 지식은 계속 발전한다. 이미 배포된 파운데이션 모델을 새로운 임상 근거로 업데이트할 때, 기존에 확립된 지식의 파국적 망각(catastrophic forgetting) 없이 어떻게 갱신할 수 있는가?

통합 경로: 발표된 모델과 임상 워크플로(PACS, EHR, CDSS)에 통합된 도구 사이의 간극은 매우 크다. 이를 연결하기 위해 어떤 인프라가 필요한가?

연구자에게 주는 시사점

의료 AI 분야에 종사한다면, 비전-언어 파운데이션 모델 패러다임이 현재 지배적인 접근법이라는 것을 인식해야 한다—그리고 그럴 만한 이유가 있다. 이미지와 텍스트를 함께 추론하는 능력은 단일 모달리티 접근법이 구현할 수 없는 방식으로 임상적 인지를 반영한다. 그러나 세 가지 주의 사항이 필요하다.

첫째, 다양한 집단에 대한 검증은 타협할 수 없는 요건이다. 3차 대학병원 데이터로만 검증된 모델은 벤치마크 성능과 무관하게 배포 준비가 되어 있지 않다.

둘째, 설명 가능성은 선택 사항이 아니다. 개념 기반 접근법(Nie et al.)은 임상적으로 가장 신뢰할 수 있는 방향을 제시하지만, 올바르게 구현하기 위해서는 상당한 도메인 전문 지식이 요구된다.

셋째, 종양학 모델은 인상적이지만 적용 범위가 제한적이다—단일 모델로, 특정 암 유형에 대해, 특정 코호트에서 검증되었을 뿐이다. 이 성과와 범용적으로 배포 가능한 의료 AI 사이의 간극은 여전히 방대하다.

이 분야를 발전시킬 연구자는 벤치마크 최적화의 유혹을 뿌리치고, 대신 임상 의학의 복잡하고 혼란스러우며 윤리적으로 민감한 현실에 최적화하는 사람들일 것이다.

References (5)

[1] Xiang, J., Wang, X., Zhang, X. et al. (2025). A vision–language foundation model for precision oncology. Nature.

DOI Scholar

[2] Shi, D., Zhang, W., Yang, J. et al. (2025). A multimodal visual–language foundation model for computational ophthalmology. npj Digital Medicine.

DOI Scholar

[3] Wu, J., Wang, Y., Zhong, Z. et al. (2025). Vision-language foundation model for 3D medical imaging. Nature Machine Intelligence.

DOI Scholar

[4] Nie, Y., He, S., Bie, Y. et al. (2025). An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training.

Scholar

[5] van Veldhuizen, V., Botha, V., Lu, C. et al. (2025). Foundation Models in Medical Imaging - A Review and Outlook. arXiv:2506.09095.

DOI Scholar

Vision-Language Foundation Models in Precision Oncology

The Architectural Shift: From Single-Modal to Joint Understanding

Beyond Cancer: Ophthalmology and 3D Imaging

The Explainability Imperative

The Uncomfortable Questions

Does Performance Generalize Across Populations?

Who Bears Liability?

What Happens to Clinical Skill?

Claims and Evidence

Open Questions

What This Means for Your Research

178회 인용 및 증가 중: 정밀 종양학에서의 비전-언어 파운데이션 모델

아키텍처의 전환: 단일 모달에서 통합 이해로

암을 넘어서: 안과학과 3D 영상

설명 가능성의 필수성

불편한 질문들

성능이 다양한 집단에 걸쳐 일반화되는가?

책임은 누가 지는가?

임상 역량은 어떻게 되는가?

주장과 근거

미해결 질문들

연구자에게 주는 시사점

References (5)

Explore this topic deeper