M3FM (Multimodal, Multidomain, Multilingual Medical Foundation Model)¶
Overview¶
Type: Medical Vision-Language Foundation Model
Architecture: MultiMedCLIP (CLIP-style) + MultiMedLM (medical LLM)
Modality: Chest X-ray, CT, radiology reports (English + Chinese)
Primary use: Zero-shot medical report generation and disease diagnosis across domains and languages
Purpose & Design Philosophy¶
M3FM is a medical foundation model designed for zero-shot radiology report generation and diagnosis across imaging modalities (CXR, CT) and languages (English, Chinese). The model learns a shared vision-language embedding space through contrastive learning (MultiMedCLIP), then trains a multilingual medical LLM (MultiMedLM) to generate reports and support diagnosis without labeled data in the target domain or language.
Key innovation: Single model handles multiple imaging modalities and languages through CLIP-style alignment + medical LLM, enabling deployment where labeled data is scarce.
Architecture Highlights¶
- Two-stage training:
- Stage 1: MultiMedCLIP aligns images (CXR, CT) with English text, and English-Chinese text pairs
- Stage 2: MultiMedLM trained on multilingual corpora for report generation
- Vision encoder: CNN/ViT encoding CXR and CT to visual embeddings
- Text encoder/decoder: Transformer-based for English and Chinese reports
- Alignment: CLIP-like contrastive loss creates shared embedding space
- Inference: Zero-shot report generation via visual → text decoding through aligned space
Integration Strategy¶
For Neuro-Omics KB¶
M3FM provides medical imaging integration patterns:
Key lessons for brain imaging + clinical text: - Two-tower alignment: Separate brain imaging encoder + clinical text encoder with contrastive loss - Zero-shot transfer: Applicable to new cohorts (e.g., Cha Hospital) without labeled data - Multilingual support: Extend brain-behavior models to non-English populations - Report generation: Automate clinical summaries from neuroimaging
Potential adaptation:
Brain MRI/fMRI → Vision encoder (SwiFT/BrainLM) → |
| Contrastive alignment
Clinical notes → Text encoder (medical LLM) → |
↓
Shared latent space
↓
Report generation LLM
For ARPA-H Brain-Omics Model (BOM)¶
M3FM demonstrates clinical translation patterns:
Brain embeddings → |
| Two-tower contrastive alignment
Clinical text → | ↓
| Shared embedding space
Gene annotations → | ↓
| Medical LLM for report generation
Transfer insights: - Zero-shot diagnosis: Critical for rare neurological disorders with limited training data - Cross-domain generalization: M3FM's CXR→CT transfer informs MRI→fMRI→CT transfers - Multilingual clinical AI: Extend neuro-omics models to global cohorts - Few-shot learning: Strong performance with minimal downstream labels
Embedding Extraction Workflow¶
If adapting M3FM for brain imaging:
# 1. Train CLIP-style alignment on brain scans + clinical notes
# 2. Load pretrained brain FM (SwiFT, BrainLM) as vision encoder
# 3. Load medical LLM (Me-LLaMA, etc.) as text encoder
# 4. Contrastive training on paired brain-text data
# 5. Extract embeddings from shared space for downstream tasks
For neuro-omics: - Vision encoder: SwiFT (fMRI) or BrainLM (3D volumes) - Text encoder: Medical LLM pretrained on neurology literature - Alignment data: Brain scans + radiology reports from UKB, HCP
Strengths & Limitations¶
Strengths¶
- Genuine zero-shot: Generates reports without labeled downstream data
- Cross-domain + cross-language: Single model handles CXR, CT, English, Chinese
- Clinical validation: Evaluated on 9 downstream datasets (COVID-19, TB, etc.)
- Practical: Leverages machine translation to bootstrap multilingual capabilities
Limitations¶
- Machine translation artifacts: Reliance on MT for Chinese may introduce biases
- Modality coverage: Only CXR and CT—no MRI, ultrasound, pathology
- Compute intensive: Requires substantial resources for two-stage training
- Evaluation gaps: Standard metrics may not capture clinical safety
When to Use M3FM¶
✅ Use as reference when: - Building brain imaging + clinical text models - Designing zero-shot transfer for new cohorts - Implementing CLIP-style alignment for neuro-omics - Supporting multilingual neuroimaging research
⚠️ Do not use directly for: - Neuroimaging (trained on CXR/CT, not brain scans) - Production clinical diagnosis (requires validation) - Non-imaging modalities (no genetics support)
⚠️ Consider alternatives: - BAGEL/MoT: For unified understanding + generation - TITAN: For high-resolution pathology imaging - Me-LLaMA: For medical LLM without imaging
Reference Materials¶
Knowledge Base Resources¶
Curated materials in this KB:
- Paper Summary (PDF Notes): M3FM (2025)
- Code walkthrough: M3FM walkthrough
- Model card (YAML): kb/model_cards/m3fm.yaml
- Paper card (YAML): kb/paper_cards/m3fm_2025.yaml
Integration recipes: - Multimodal Architectures - Design Patterns — Two-tower contrastive section - Integration Strategy
Original Sources¶
Source code repositories:
- Local copy: external_repos/M3FM/
- Official GitHub: ai-in-health/M3FM
Original paper: - Title: "M3FM: A Multimodal, Multidomain, Multilingual Medical Foundation Model for Zero‑Shot Clinical Diagnosis" - Authors: Liu, Fenglin; Li, Zheng; Yin, Qingyu; Huang, Jinfa; Luo, Jiebo; Thakur, Anshul; Branson, Kim; Schwab, Patrick; Yin, Bing; Wu, Xian; Zheng, Yefeng; Clifton, David A. - Published: npj Digital Medicine, 2025 - Link: Nature: s41746-024-01339-7 - DOI: 10.1038/s41746-024-01339-7 - PDF Notes: m3fm_2025.pdf
Next Steps in Our Pipeline¶
- CLIP adaptation: Implement brain imaging + clinical text contrastive learning
- Zero-shot evaluation: Test on new cohorts (Cha Hospital) without fine-tuning
- Multilingual extension: Adapt to Korean clinical notes for Cha pediatric cohort
- Report generation: Automate neuroimaging report synthesis from embeddings
- Diagnostic support: Combine M3FM patterns with gene-brain fusion for clinical predictions
Engineering Notes¶
- M3FM's two-stage training separates alignment from generation—applicable to neuro-omics
- Contrastive learning requires paired data—use UKB radiology reports + imaging
- Machine translation can bootstrap multilingual capabilities before human-labeled data available
- Zero-shot evaluation critical for rare neurological disorders with <100 cases