Skip to content

🏥 Multimodal & Clinical Foundation Models

Unified multimodal architectures informing gene-brain-behavior integration


📋 Overview

This section covers multimodal and clinical foundation models that integrate multiple modalities beyond genetics and neuroimaging, including medical imaging, text, video, and clinical data. These models represent the state-of-the-art in unified multimodal AI for healthcare and general-purpose vision-language understanding.

🎯 Model Registry

General Multimodal Models

Model Architecture Key Innovation Parameters Documentation
Flamingo Perceiver + gated cross-attention Few-shot multimodal VLM via frozen encoders 3B / 4B / 9B / 80B Code Walkthrough
BAGEL MoT decoder + SigLIP + VAE Unified understanding + generation 7B active / 14B total Code Walkthrough
MoT Sparse transformer Modality-aware FFNs (~55% FLOPs) Scales to 7B+ Code Walkthrough

Medical Multimodal Models

Model Architecture Clinical Focus Languages Documentation
M3FM CLIP + medical LLM CXR + CT report generation EN + CN Code Walkthrough
Me-LLaMA Continual pretrained LLaMA Medical knowledge integration English Code Walkthrough
TITAN Vision transformer Whole-slide pathology (gigapixel) English Code Walkthrough

Medical Data Catalog

Resource Coverage Use Case Documentation
FMS-Medical 100+ medical datasets Dataset discovery + benchmarking Code Walkthrough

Why Multimodal Models Matter for Neuro-Omics

While the Neuro-Omics KB focuses primarily on genetics and brain foundation models, understanding multimodal integration patterns is critical for:

  1. Integration Strategy Design
  2. BAGEL and MoT demonstrate successful architectures for combining diverse modalities
  3. Medical models show how to handle domain-specific data with limited labels

  4. Zero-Shot Transfer Learning

  5. Medical models excel at cross-domain and cross-language generalization
  6. These patterns inform how to transfer gene-brain models to new cohorts

  7. Clinical Translation

  8. Medical VLMs provide templates for integrating brain imaging with clinical text
  9. Pathology models show how to scale vision transformers to gigapixel inputs

  10. LLM Integration

  11. Me-LLaMA demonstrates medical knowledge injection into general LLMs
  12. This approach extends to neuro-omics applications (e.g., genetics literature + brain phenotypes)

Integration with ARPA-H Brain-Omics Model (BOM)

The BOM vision includes multimodal integration beyond gene-brain fusion:

Gene embeddings → |
                  | → Brain-Omics Model (BOM) → Clinical predictions
Brain embeddings →|                             ↓
                  |                         Multimodal LLM
Clinical text    →|                         (reasoning + reports)

Multimodal models inform the BOM design in three ways:

1. Architecture Patterns

  • BAGEL/MoT: Show how to build unified models with understanding + generation
  • M3FM: Demonstrates two-tower CLIP-style alignment for medical domains
  • TITAN: Provides hierarchical vision transformer patterns for multi-scale data

2. Training Strategies

  • Zero-shot capabilities: Critical for rare diseases and new cohorts
  • Multilingual support: Extends models to diverse global populations
  • Continual pretraining: Me-LLaMA shows how to inject domain knowledge post-hoc

3. Clinical Workflows

  • Report generation: Automated clinical summaries from multimodal inputs
  • Diagnosis support: Combining embeddings for downstream classification
  • Few-shot adaptation: Rapid deployment with minimal labeled data

Model Selection Guide

Choose multimodal models based on your integration goals:

For Architecture Design

  • If building unified understanding + generation:
  • Start with BAGEL or MoT architectures
  • These show how to handle multiple modalities in one model

For Medical Applications

  • If working with medical imaging + text:
  • Use M3FM for CLIP-style alignment
  • Consider TITAN for pathology/high-resolution imaging

  • If integrating medical knowledge with LLMs:

  • Study Me-LLaMA for continual pretraining approaches
  • See FMS-Medical for dataset selection

For Zero-Shot Transfer

  • If targeting low-resource settings:
  • All medical models demonstrate strong zero-shot capabilities
  • M3FM is particularly strong for cross-language transfer

Next Steps

  1. Read model pages:
  2. Each model page includes architecture details, integration strategies, and reference materials

  3. Review integration patterns:

  4. See Design Patterns for fusion architectures
  5. Check Multimodal Architectures for detailed integration guides

  6. Explore code walkthroughs:

  7. Practical implementation details in Code Walkthroughs

  8. Study paper summaries:

  9. Full paper notes available in Research Papers section (see site navigation)

Reference Materials