Trend AnalysisComputer SystemsExperimental Design

Federated Learning in Healthcare and Finance: From Theory to Deployment Reality

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Federated learning was supposed to solve the central tension of modern machine learning: you need large, diverse datasets to train good models, but the data you need is locked inside hospitals and banks that cannot—and should not—share it. The theory is elegant: train models locally, share only gradients, aggregate centrally. The practice, as a growing body of deployment-focused research reveals, is considerably messier. Data heterogeneity, communication overhead, and the gap between benchmark performance and real-world utility remain persistent challenges that no single framework has resolved.

The Research Landscape

Medical Imaging: The Leading Deployment Domain

Federated learning has found its most natural application in medical imaging, where patient data is both highly sensitive and distributed across institutions with different equipment, protocols, and patient populations.

Babar et al. (2024) provide the most empirically grounded analysis of data heterogeneity's impact on FL performance. Using the COVIDx CXR-3 chest X-ray dataset partitioned across multiple nodes to simulate institutional differences, they demonstrate that non-IID (non-independent and identically distributed) data partitions degrade FedAvg performance meaningfully compared to centralized training. This is not a surprising finding, but the systematic quantification across heterogeneity conditions provides a reliable baseline that later work builds upon.

Mastoi et al. (2025) combine FL with explainable AI for brain tumor classification from MRI scans. Their framework achieves competitive classification accuracy while providing interpretable attention maps that show which image regions drive predictions. The explainability component addresses a practical deployment barrier: clinicians are reluctant to trust black-box predictions in diagnostic contexts, and FL adds an additional layer of opacity since the model was trained on data the clinician never saw.

Gupta et al. (2025) benchmark three prominent FL frameworks—NVIDIA FLARE, Flower, and Owkin Substra—for medical imaging deployment. This comparative evaluation addresses a practical question that academic papers often neglect: which framework should an institution actually use? Their findings suggest that NVIDIA FLARE offers the most mature infrastructure for production deployment, Flower provides the greatest flexibility for research, and Substra prioritizes regulatory compliance.

Rahmaniar et al. (2025) survey the broader landscape of FL in medical imaging, identifying three phases of the field's evolution: initial proof-of-concept studies (2019-2021), benchmark development (2022-2023), and deployment-focused work (2024-2025). They argue that the field is transitioning from demonstrating that FL can work to understanding when and how it should be deployed.

Financial Fraud Detection: The Second Wave

Finance represents the second major deployment domain for FL, driven by a specific problem structure: fraudulent transactions are rare events distributed across institutions, and no single bank sees enough fraud to train a robust detector alone.

Abadi et al. (2024) present Starlit, a privacy-preserving FL framework specifically designed for financial fraud detection. Starlit addresses a limitation of standard FL: even gradient sharing can leak information about individual transactions. Their framework adds differential privacy and secure aggregation to standard FL, reducing the information leakage risk while maintaining detection performance close to centralized baselines.

Aljunaid et al. (2025) —the highest in this cohort—combine FL with explainable AI for banking fraud detection. Their approach achieves high detection accuracy while providing feature-importance explanations that satisfy regulatory requirements for model interpretability. The high citation count reflects a convergence of practical needs: banks need fraud detection that works, regulators need explanations, and privacy law prohibits data sharing.

Kasyap et al. (2024) address the personalization problem in FL for finance. Standard FL produces a single global model, but different banks have different fraud patterns, customer bases, and risk profiles. Their personalized FL approach allows each participating institution to maintain a locally adapted model while still benefiting from collaborative training, improving detection of institution-specific fraud patterns beyond what the global model achieves.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Data heterogeneity degrades FL performance meaningfully	Babar et al. systematic evaluation	Supported — consistent finding across conditions
Explainable FL improves clinical trust and adoption	Mastoi et al. brain tumor study	Partially supported — explainability demonstrated, adoption impact assumed
NVIDIA FLARE is most deployment-ready for healthcare	Gupta et al. framework benchmark	Supported for current state — but frameworks evolve rapidly
Privacy-preserving FL maintains detection close to centralized	Abadi et al. Starlit	Supported — but specific to their experimental setup
Personalized FL outperforms global FL for institution-specific fraud	Kasyap et al. personalization study	Supported — 5-12% improvement on local fraud patterns
FL has matured from proof-of-concept to deployment	Rahmaniar et al. survey	Partially supported — deployment-focused work exists, but large-scale production deployments remain limited

Open Questions and Future Directions

The heterogeneity gap. Babar et al. show 8-15% degradation with non-IID data. Can this gap be closed, or is it a fundamental cost of privacy preservation? Approaches like FedProx and SCAFFOLD reduce but do not eliminate it.

Regulatory alignment. Healthcare (HIPAA, GDPR) and finance (PSD2, AI Act) have different regulatory frameworks. FL frameworks that work for one domain may not satisfy the requirements of another. Cross-domain frameworks are needed.

Communication efficiency at scale. Most FL studies involve 5-20 participating nodes. Real hospital networks or banking consortia may involve hundreds. Communication overhead scales with participant count, and current compression techniques may be insufficient.

Adversarial robustness. FL is vulnerable to poisoning attacks where a malicious participant submits corrupted gradients. In healthcare and finance, the consequences of a poisoned model are severe. Robust aggregation methods exist but add computational and communication overhead.

Incentive design. Why should institutions participate in FL? The benefits are asymmetric—institutions with less data benefit more from collaboration. Without proper incentive mechanisms, the institutions with the best data may decline to participate.

What This Means for Practitioners

The message from this body of work is cautiously optimistic. FL works for both healthcare imaging and financial fraud detection, but deploying it requires careful attention to data heterogeneity, framework selection, and regulatory compliance. For healthcare institutions evaluating FL, Gupta et al.'s framework comparison provides a practical starting point. For financial institutions, the combination of personalized FL (Kasyap et al.) with privacy-preserving aggregation (Abadi et al.) represents the current best practice.

The gap between academic benchmarks and production deployment remains the field's central challenge. Closing it requires not just algorithmic advances but engineering maturity in areas like monitoring, debugging, and model governance.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 포스트는 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 반드시 검증해야 한다.

의료 및 금융 분야의 연합 학습: 이론에서 실제 배포까지

연합 학습(Federated Learning)은 현대 머신러닝의 핵심적인 긴장 관계를 해결하기 위해 고안되었다. 즉, 우수한 모델을 훈련하려면 크고 다양한 데이터셋이 필요하지만, 필요한 데이터는 공유할 수 없거나 공유해서는 안 되는 병원과 은행 내에 갇혀 있다. 이론은 우아하다. 로컬에서 모델을 훈련하고, 그래디언트(gradient)만 공유하며, 중앙에서 집계하면 된다. 하지만 배포에 초점을 맞춘 연구들이 점점 늘어나면서 드러나는 실제 양상은 상당히 복잡하다. 데이터 이질성(data heterogeneity), 통신 오버헤드, 그리고 벤치마크 성능과 실제 활용성 사이의 격차는 어떤 단일 프레임워크도 해결하지 못한 고질적인 문제로 남아 있다.

연구 현황

의료 영상: 선도적인 배포 도메인

연합 학습은 의료 영상 분야에서 가장 자연스러운 적용처를 찾았다. 환자 데이터는 민감도가 매우 높은 동시에, 서로 다른 장비, 프로토콜, 환자 집단을 보유한 기관들에 분산되어 있기 때문이다.

Babar et al. (2024)은 데이터 이질성이 FL 성능에 미치는 영향에 대해 가장 실증적으로 근거 있는 분석을 제시한다. 기관 간 차이를 시뮬레이션하기 위해 COVIDx CXR-3 흉부 X선 데이터셋을 여러 노드에 분할하여 활용한 이들의 연구는, non-IID(비독립동일분포) 데이터 분할이 중앙집중식 훈련 대비 FedAvg 성능을 유의미하게 저하시킨다는 점을 입증한다. 이는 놀라운 발견은 아니지만, 이질성 조건 전반에 걸친 체계적인 정량화는 후속 연구의 토대가 되는 신뢰할 만한 기준선을 제공한다.

Mastoi et al. (2025)은 MRI 스캔을 통한 뇌종양 분류에 FL과 설명 가능한 AI(explainable AI)를 결합한다. 이들의 프레임워크는 경쟁력 있는 분류 정확도를 달성하는 동시에, 어떤 이미지 영역이 예측을 주도하는지 보여주는 해석 가능한 어텐션 맵(attention map)을 제공한다. 설명 가능성 요소는 실제 배포의 장벽을 해소한다. 임상의들은 진단 상황에서 블랙박스 예측을 신뢰하기를 꺼리는데, FL은 임상의가 한 번도 본 적 없는 데이터로 모델이 훈련되었다는 점에서 불투명성을 한 층 더 가중시키기 때문이다.

Gupta et al. (2025)은 의료 영상 배포를 위한 세 가지 주요 FL 프레임워크—NVIDIA FLARE, Flower, Owkin Substra—를 벤치마크 평가한다. 이 비교 평가는 학술 논문이 종종 간과하는 실용적 질문을 다룬다. 기관이 실제로 어떤 프레임워크를 사용해야 하는가? 이들의 연구 결과에 따르면, NVIDIA FLARE는 프로덕션 배포를 위한 가장 성숙한 인프라를 제공하고, Flower는 연구를 위한 가장 높은 유연성을 제공하며, Substra는 규제 준수를 우선시한다.

Rahmaniar et al. (2025)은 의료 영상 분야 FL의 더 넓은 현황을 개괄하며, 해당 분야의 발전을 세 단계로 구분한다. 초기 개념 증명 연구(2019-2021), 벤치마크 개발(2022-2023), 그리고 배포 중심 연구(2024-2025)가 그것이다. 이들은 해당 분야가 FL이 작동할 수 있음을 입증하는 단계에서 언제, 어떻게 배포해야 하는지를 이해하는 단계로 전환 중이라고 주장한다.

금융 사기 탐지: 두 번째 물결

금융 분야는 FL의 두 번째 주요 배포 도메인으로, 특정한 문제 구조에 의해 주도된다. 사기 거래는 기관들에 분산된 희귀 사건이며, 어떤 단일 은행도 강력한 탐지기를 단독으로 훈련할 만큼 충분한 사기 데이터를 보유하지 못한다.

Abadi et al. (2024)은 금융 사기 탐지를 위해 특별히 설계된 프라이버시 보존 FL 프레임워크인 Starlit을 제시한다. Starlit은 표준 FL의 한계를 해결한다. 그래디언트 공유만으로도 개별 거래 정보가 유출될 수 있기 때문이다. 이들의 프레임워크는 표준 FL에 차등 프라이버시(differential privacy)와 보안 집계(secure aggregation)를 추가하여, 중앙집중식 기준선에 근접한 탐지 성능을 유지하면서 정보 유출 위험을 줄인다. Aljunaid et al. (2025)—이 코호트에서 가장 높은 인용 수를 기록한 연구—는 은행 사기 탐지를 위해 FL과 설명 가능한 AI를 결합한다. 이들의 접근 방식은 높은 탐지 정확도를 달성하는 동시에, 모델 해석 가능성에 대한 규제 요건을 충족하는 특성 중요도 설명을 제공한다. 높은 인용 수는 실용적 필요들의 수렴을 반영한다. 은행은 작동하는 사기 탐지 시스템이 필요하고, 규제 기관은 설명을 요구하며, 개인정보 보호법은 데이터 공유를 금지한다.

Kasyap et al. (2024)은 금융 분야 FL에서의 개인화 문제를 다룬다. 표준 FL은 단일 글로벌 모델을 생성하지만, 은행마다 사기 패턴, 고객 기반, 리스크 프로파일이 다르다. 이들의 개인화된 FL 접근 방식은 각 참여 기관이 협력 학습의 이점을 유지하면서도 로컬에 적응된 모델을 유지할 수 있도록 하여, 글로벌 모델이 달성하는 수준을 넘어서는 기관별 사기 패턴 탐지를 개선한다.

비판적 분석: 주장과 근거

주장	근거	판정
데이터 이질성은 FL 성능을 의미 있게 저하시킨다	Babar et al. 체계적 평가	지지됨 — 조건 전반에 걸쳐 일관된 발견
설명 가능한 FL은 임상적 신뢰와 도입을 향상시킨다	Mastoi et al. 뇌종양 연구	부분적으로 지지됨 — 설명 가능성은 입증되었으나, 도입 영향은 가정된 것
NVIDIA FLARE가 헬스케어 분야에서 가장 배포 준비가 된 프레임워크이다	Gupta et al. 프레임워크 벤치마크	현재 상태에 대해 지지됨 — 단, 프레임워크는 빠르게 발전함
프라이버시 보존 FL은 중앙화 방식에 근접한 탐지 성능을 유지한다	Abadi et al. Starlit	지지됨 — 단, 이들의 실험 설정에 한정됨
개인화된 FL은 기관별 사기 탐지에서 글로벌 FL을 능가한다	Kasyap et al. 개인화 연구	지지됨 — 로컬 사기 패턴에서 5-12% 향상
FL은 개념 증명 단계에서 배포 단계로 성숙했다	Rahmaniar et al. 서베이	부분적으로 지지됨 — 배포 중심 연구는 존재하나, 대규모 실제 배포는 여전히 제한적

미해결 과제와 향후 방향

이질성 격차. Babar et al.은 non-IID 데이터에서 8-15%의 성능 저하를 보인다. 이 격차는 극복될 수 있는가, 아니면 프라이버시 보존의 근본적인 비용인가? FedProx 및 SCAFFOLD 같은 접근 방식은 이를 감소시키지만 완전히 제거하지는 못한다.

규제 정합성. 헬스케어(HIPAA, GDPR)와 금융(PSD2, AI Act)은 서로 다른 규제 체계를 가진다. 한 도메인에 적합한 FL 프레임워크가 다른 도메인의 요건을 충족하지 못할 수 있다. 크로스 도메인 프레임워크가 필요하다.

대규모 환경에서의 통신 효율성. 대부분의 FL 연구는 5-20개의 참여 노드를 포함한다. 실제 병원 네트워크나 은행 컨소시엄은 수백 개를 포함할 수 있다. 통신 오버헤드는 참여자 수에 따라 증가하며, 현재의 압축 기술로는 부족할 수 있다.

적대적 견고성. FL은 악의적인 참여자가 손상된 그래디언트를 제출하는 포이즈닝 공격에 취약하다. 헬스케어와 금융에서 오염된 모델의 결과는 심각하다. 견고한 집계 방법이 존재하지만 계산 및 통신 오버헤드를 추가한다.

인센티브 설계. 기관은 왜 FL에 참여해야 하는가? 이점은 비대칭적이다—데이터가 적은 기관이 협력으로부터 더 많은 이익을 얻는다. 적절한 인센티브 메커니즘 없이는 최상의 데이터를 보유한 기관이 참여를 거부할 수 있다.

실무자에게 주는 시사점

이 연구들이 전하는 메시지는 신중하게 낙관적이다. FL은 헬스케어 영상과 금융 사기 탐지 모두에서 효과적이지만, 배포를 위해서는 데이터 이질성, 프레임워크 선택, 규제 준수에 대한 세심한 주의가 필요하다. FL을 평가 중인 헬스케어 기관에게는 Gupta et al.의 프레임워크 비교가 실용적인 출발점을 제공한다. 금융 기관에게는 개인화된 FL(Kasyap et al.)과 프라이버시 보존 집계(Abadi et al.)의 결합이 현재의 모범 사례를 대표한다. 학술 벤치마크와 실제 배포 사이의 격차는 이 분야의 핵심 과제로 남아 있다. 이를 해소하기 위해서는 알고리즘적 발전뿐만 아니라 모니터링, 디버깅, 모델 거버넌스 등의 영역에서 엔지니어링 성숙도가 요구된다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (8)

[1] Babar, M., Qureshi, B., & Koubaa, A. (2024). Investigating the impact of data heterogeneity on the performance of federated learning algorithm using medical imaging. PLoS ONE, 19.

DOI Scholar

[2] Mastoi, Q., Latif, S., & Brohi, S. (2025). Explainable AI in medical imaging: an interpretable and collaborative federated learning model for brain tumor classification. Frontiers in Oncology.

DOI Scholar

[3] Gupta, R., Chowdhury, A., & Nalawade, S. (2025). Benchmarking Federated Learning Frameworks for Medical Imaging Deployment: A Comparative Study of NVIDIA FLARE, Flower, and Owkin Substra. arXiv preprint.

DOI Scholar

[4] Rahmaniar, W., Deng, Z., & Yang, Y. (2025). Future of the Medical World: Collaborative Medical Imaging AI With Federated Learning. IEEE Consumer Electronics Magazine.

DOI Scholar

[5] Abadi, A., Doyle, B., & Gini, F. (2024). Starlit: Privacy-Preserving Federated Learning to Enhance Financial Fraud Detection. IEEE FLTA.

DOI Scholar

[6] Aljunaid, S., Almheiri, S., & Dawood, H. (2025). Secure and Transparent Banking: Explainable AI-Driven Federated Learning Model for Financial Fraud Detection. JRFM, 18(4), 179.

DOI Scholar

[7] Kasyap, H., Atmaca, U., & Maple, C. (2024). Privacy-preserving personalised federated learning financial fraud detection. IET Conference Proceedings.

DOI Scholar

Hu, J., Yang, Z., Wang, P., Zhao, G., Huang, H., Zong, Z., et al. (2025). Federated Learning for Medical Image Analysis: Privacy-Preserving Paradigms and Clinical Challenges. Transactions on Artificial Intelligence.

DOI Scholar

Federated Learning in Healthcare and Finance: From Theory to Deployment Reality

The Research Landscape

Medical Imaging: The Leading Deployment Domain

Financial Fraud Detection: The Second Wave

Critical Analysis: Claims and Evidence

Open Questions and Future Directions

What This Means for Practitioners

의료 및 금융 분야의 연합 학습: 이론에서 실제 배포까지

연구 현황

의료 영상: 선도적인 배포 도메인

금융 사기 탐지: 두 번째 물결

비판적 분석: 주장과 근거

미해결 과제와 향후 방향

실무자에게 주는 시사점

References (8)

Explore this topic deeper