Multimodal Architecture Patterns for Brain-Omics Models¶
This document catalogs architectural patterns from multimodal foundation models that inform the design of ARPA-H-style Brain-Omics Model (BOM) systems. These models demonstrate how to fuse heterogeneous modalities (vision, language, time series, structured data) at scale—lessons directly applicable to gene–brain–behavior–language integration.
Overview¶
Purpose: Extract design principles from state-of-the-art multimodal FMs to guide Neuro-Omics KB integration strategies as they escalate from late fusion → two-tower contrastive → unified multimodal architectures.
Scope: Medical/clinical multimodal FMs, unified vision-language-speech models, and sparse multimodal transformers.
1. BAGEL — Unified Multimodal Foundation Model¶
Architecture Summary¶
Model: BAGEL (Emerging Properties in Unified Multimodal Pretraining)
Paper: arXiv:2505.14683 | Card: kb/paper_cards/bagel_2025.yaml
- Backbone: Qwen2.5 decoder-only transformer (7B active, 14B total with MoT experts)
- Modalities: Text, images, video, web data
- Architecture: Mixture-of-Transformer-Experts (MoT) with separate experts for understanding vs. generation
- Visual encoding: SigLIP2-style ViT encoder for understanding
- Visual generation: FLUX VAE + rectified-flow diffusion conditioned on transformer states
- Training: Trillions of interleaved multimodal tokens with reasoning-oriented curation
Key Design Patterns¶
✅ Unified decoder-only architecture: Single transformer processes all modalities as token sequences
✅ Mixture-of-experts (MoT): Separate experts for understanding (comprehension) vs. generation tasks
✅ Interleaved data: Reasoning-oriented multimodal corpus with natural task diversity
✅ Emergent capabilities: Complex reasoning, free-form manipulation, 3D understanding from unified pretraining
Implications for Brain-Omics Models¶
Direct applications: - Gene-brain-language unification: Treat genetics (nucleotide tokens), brain (parcel tokens), and behavior (structured tokens) as additional modalities alongside text - MoT for neuro-omics: Separate experts for discriminative (gene-brain association) vs. generative (report generation, counterfactual prediction) tasks - Interleaved corpus design: Create multimodal corpus pairing genetic variants + brain scans + cognitive assessments + clinical narratives
Escalation path: 1. Late fusion baselines (current) 2. Two-tower contrastive (gene encoder ↔ brain encoder) 3. MoT-style unified architecture where genetics/brain/behavior tokens share decoder with modality-specific experts
Reference materials: - BAGEL walkthrough - BAGEL paper card
2. MoT — Mixture-of-Transformers¶
Architecture Summary¶
Model: Mixture-of-Transformers (Sparse and Scalable for Multi-Modal FMs)
Paper: arXiv:2411.04996 | Card: kb/paper_cards/mot_2025.yaml
- Backbone: Sparse multimodal transformer with modality-aware FFNs/attention
- Modalities: Text, images, speech
- Sparsity mechanism: Separate FFN/attention projections per modality; shared global self-attention
- Settings: Chameleon-style autoregressive + Transfusion-style diffusion
- Efficiency: ~55.8% FLOPs of dense baseline, similar or better performance
Key Design Patterns¶
✅ Modality-aware sparsity: Decouple non-embedding parameters by modality
✅ Shared global attention: All tokens interact via self-attention (no routing)
✅ Drop-in replacement: Compatible with existing dense transformer architectures
✅ Stable scaling: Maintains performance across model sizes (1B → 7B → 30B)
Implications for Brain-Omics Models¶
Direct applications: - Per-modality FFNs: Separate feed-forward networks for genetics, brain MRI, fMRI, EEG, behavior tokens - Shared attention: Global self-attention over all modalities captures cross-modal dependencies - Compute efficiency: Critical for scaling to large cohorts (UK Biobank N=500k+)
Integration with Neuro-Omics KB: - Implement modality-specific projectors (genetics_ffn, brain_ffn, behavior_ffn) - Retain shared attention over concatenated gene+brain+behavior tokens - Compare vs. learned MoE routing (simpler, more interpretable)
Reference materials: - MoT walkthrough - MoT paper card
3. M3FM — Multilingual Medical Model¶
Architecture Summary¶
Model: M3FM (Multilingual Chest X-ray Report Generator)
Repo: ai-in-health/M3FM | Card: kb/model_cards/m3fm.yaml
- Backbone: Multilingual CLIP encoder + relational-memory Transformer decoder
- Modalities: Chest X-ray images, bilingual text (English/Chinese)
- Architecture: Two-tower (vision encoder + language decoder) with relational memory
- Decoder: Language selection via BOS token (1=English, 2=Chinese)
- Training: COV-CTR COVID-era CXR dataset with multilingual reports
Key Design Patterns¶
✅ Two-tower fusion: Vision encoder outputs → cross-attention in language decoder
✅ Language-aware generation: Single decoder handles multiple languages via BOS conditioning
✅ Relational memory: Augmented attention for capturing long-range report dependencies
✅ Medical domain adaptation: CLIP text embeddings projected for medical terminology
Implications for Brain-Omics Models¶
Direct applications: - Brain-omics-to-language: Project brain/genetics embeddings into CLIP-like space → generate clinical narratives - Bilingual reporting: Extend to English/Korean for Cha Hospital developmental cohorts - Relational memory for clinical context: Track longitudinal patient history across visits
Integration strategy: - Use M3FM-style two-tower for brain scan → clinical report generation - Adapt relational memory for multi-visit longitudinal modeling - Explore gene embedding → language generation (explain genetic risk in natural language)
Reference materials: - M3FM walkthrough - M3FM model card
4. Me-LLaMA — Medical LLM¶
Architecture Summary¶
Model: Me-LLaMA (Medical LLaMA)
Repo: BIDS-Xu-Lab/Me-LLaMA | Card: kb/model_cards/me_llama.yaml
- Backbone: LLaMA-2/3 (13B/70B) with continual pretraining + LoRA instruction tuning
- Modality: Medical text (biomedical literature, clinical notes, guidelines)
- Pretraining ratio: 15:1:4 (biomedical : clinical : general)
- Training: 129B medical tokens + 214K instruction samples
- Evaluation: 12+ medical QA/NLP tasks with prompt templates
Key Design Patterns¶
✅ Continual pretraining: Adapt general LLM to medical domain with curated corpus
✅ LoRA instruction tuning: Parameter-efficient adaptation for clinical reasoning
✅ Prompt engineering: Modality-specific prompts for different clinical tasks
✅ Evaluation harness: Structured benchmarking across medical NLP tasks
Implications for Brain-Omics Models¶
Direct applications: - Neuro-omics LLM: Continual pretrain LLaMA on neuroscience literature + genetics papers + clinical neurology notes - Instruction tuning for clinical tasks: Adapt for cognitive assessment interpretation, genetic counseling, neuroimaging report generation - Prompt templates: Create standardized prompts for gene-brain-behavior reasoning
As semantic bridge in BOM: - Me-LLaMA-style medical LLM serves as semantic hub for Brain-Omics Model - Project genetics/brain/EEG embeddings into LLM token space for cross-modal reasoning - Enable natural language queries over multimodal neuro-omics data
Reference materials: - Me-LLaMA walkthrough - Me-LLaMA model card
5. TITAN — Whole-Slide Image FM¶
Architecture Summary¶
Model: TITAN (Transformer for Integrative Tissue Analysis)
Repo: mahmoodlab/TITAN | Card: kb/model_cards/titan.yaml
- Backbone: Slide-level transformer with multi-scale patch aggregation
- Modality: Whole-slide histopathology images
- Architecture: Hierarchical attention over gigapixel images (millions of patches)
- Applications: Cancer diagnosis, survival prediction, treatment response
Key Design Patterns¶
✅ Multi-scale patch processing: Handle gigapixel images via hierarchical aggregation
✅ Attention-based pooling: Learn to aggregate informative regions
✅ Slide-level embeddings: Compress millions of patches → fixed-size vectors
✅ Task-specific heads: Shared encoder for multiple downstream tasks
Implications for Brain-Omics Models¶
Direct applications: - Brain MRI analogy: Whole-brain 3D volumes → hierarchical patch aggregation (similar to TITAN's slide processing) - Multi-scale fusion: Combine region-level (parcels) and voxel-level (fine-grained) brain features - Histology + genetics: If histopathology data available (e.g., brain tissue banks), TITAN-style processing + genetics fusion
Integration with Neuro-Omics KB: - Adapt TITAN's multi-scale attention for 3D MRI volumes - Use TITAN-style patch aggregation for whole-brain sMRI + fMRI fusion - Explore cross-modal attention: pathology patches ↔ genetic variants
Reference materials: - TITAN walkthrough - TITAN model card
6. FMS-Medical Catalog¶
Resource Summary¶
Catalog: Awesome Foundation Models for Advancing Healthcare
Repo: YutingHe-list/Awesome-Foundation-Models
- Scope: 200+ medical foundation models across modalities (text, vision, multimodal, protein, genomics, clinical time series)
- Organization: Bilingual (English/Chinese) with taxonomy by modality and task
- Usage: Reference catalog for discovering relevant medical FMs
Key Resources¶
✅ Medical vision FMs: CXR, CT, MRI, histopathology encoders
✅ Medical LLMs: Clinical text understanding and generation models
✅ Genomics/proteomics FMs: Sequence models for molecular biology
✅ Multimodal FMs: Vision-language models for radiology, pathology reports
Implications for Brain-Omics Models¶
Discovery and benchmarking: - Identify relevant medical imaging FMs for brain scan processing - Find medical LLMs for clinical narrative generation - Discover multimodal architectures to adapt for gene-brain-behavior fusion
Reference for ARPA-H integration: - Survey multimodal medical FMs to inform BOM architecture choices - Benchmark against medical FM baselines (e.g., CXR report generation → adapt for neuroimaging)
Reference materials: - FMS-Medical walkthrough - FMS-Medical catalog YAML
Integration Roadmap for Neuro-Omics KB¶
Phase 1: Late Fusion Baselines (Current)¶
- Models: Separate encoders (Caduceus, BrainLM, FreeSurfer ROIs)
- Fusion: Concatenate embeddings → LR/GBDT prediction
- Evaluation: CCA + permutation, AUROC/AUPRC, DeLong tests
Phase 2: Two-Tower Contrastive¶
- Architecture: Frozen gene encoder ↔ frozen brain encoder with learnable projectors
- Loss: InfoNCE or similar contrastive objective
- Inspiration: CLIP-style alignment (M3FM two-tower paradigm)
Phase 3: MoT-Style Sparse Integration¶
- Architecture: Shared self-attention over gene+brain+behavior tokens
- Sparsity: Modality-specific FFNs (genetics_ffn, brain_ffn, behavior_ffn)
- Inspiration: MoT paper (arXiv:2411.04996)
Phase 4: Unified Brain-Omics Model (BOM)¶
- Architecture: BAGEL-style decoder-only with MoT experts
- Modalities: Genetics (nucleotide tokens) + brain (parcel/voxel tokens) + behavior (structured tokens) + language (text tokens)
- Semantic bridge: Me-LLaMA-style medical LLM as central hub
- Training: Interleaved multimodal corpus (genetic variants + brain scans + cognitive assessments + clinical notes)
Next Steps¶
- Complete Phase 1 baselines (CCA + prediction on UKB gene-brain data)
- Pilot two-tower contrastive (gene-brain alignment with frozen encoders)
- Explore MoT-style sparsity (modality-specific FFNs vs. full early fusion)
- Design ARPA-H BOM architecture (unified multimodal transformer with neuro-omics tokens)
- Curate interleaved corpus (multimodal neuro-omics data for unified pretraining)
Reference Index¶
Walkthrough documents: - BAGEL walkthrough - MoT walkthrough - M3FM walkthrough - Me-LLaMA walkthrough - TITAN walkthrough - FMS-Medical walkthrough
Paper/model cards:
- kb/paper_cards/bagel_2025.yaml
- kb/paper_cards/mot_2025.yaml
- kb/model_cards/m3fm.yaml
- kb/model_cards/me_llama.yaml
- kb/model_cards/titan.yaml
- kb/datasets/fms_medical_catalog.yaml
Integration recipes: - Integration Strategy - Design Patterns - CCA + Permutation