Multimodal Architecture Patterns for Brain-Omics Models¶

This document catalogs architectural patterns from multimodal foundation models that inform the design of ARPA-H-style Brain-Omics Model (BOM) systems. These models demonstrate how to fuse heterogeneous modalities (vision, language, time series, structured data) at scale—lessons directly applicable to gene–brain–behavior–language integration.

Overview¶

Purpose: Extract design principles from state-of-the-art multimodal FMs to guide Neuro-Omics KB integration strategies as they escalate from late fusion → two-tower contrastive → unified multimodal architectures.

Scope: Medical/clinical multimodal FMs, unified vision-language-speech models, and sparse multimodal transformers.

1. BAGEL — Unified Multimodal Foundation Model¶

Architecture Summary¶

Model: BAGEL (Emerging Properties in Unified Multimodal Pretraining)
Paper: arXiv:2505.14683 | Card: kb/paper_cards/bagel_2025.yaml

Backbone: Qwen2.5 decoder-only transformer (7B active, 14B total with MoT experts)
Modalities: Text, images, video, web data
Architecture: Mixture-of-Transformer-Experts (MoT) with separate experts for understanding vs. generation
Visual encoding: SigLIP2-style ViT encoder for understanding
Visual generation: FLUX VAE + rectified-flow diffusion conditioned on transformer states
Training: Trillions of interleaved multimodal tokens with reasoning-oriented curation

Key Design Patterns¶

✅ Unified decoder-only architecture: Single transformer processes all modalities as token sequences
✅ Mixture-of-experts (MoT): Separate experts for understanding (comprehension) vs. generation tasks
✅ Interleaved data: Reasoning-oriented multimodal corpus with natural task diversity
✅ Emergent capabilities: Complex reasoning, free-form manipulation, 3D understanding from unified pretraining

Implications for Brain-Omics Models¶

Direct applications: - Gene-brain-language unification: Treat genetics (nucleotide tokens), brain (parcel tokens), and behavior (structured tokens) as additional modalities alongside text - MoT for neuro-omics: Separate experts for discriminative (gene-brain association) vs. generative (report generation, counterfactual prediction) tasks - Interleaved corpus design: Create multimodal corpus pairing genetic variants + brain scans + cognitive assessments + clinical narratives

Escalation path: 1. Late fusion baselines (current) 2. Two-tower contrastive (gene encoder ↔ brain encoder) 3. MoT-style unified architecture where genetics/brain/behavior tokens share decoder with modality-specific experts

Reference materials: - BAGEL walkthrough - BAGEL paper card

2. MoT — Mixture-of-Transformers¶

Architecture Summary¶

Model: Mixture-of-Transformers (Sparse and Scalable for Multi-Modal FMs)
Paper: arXiv:2411.04996 | Card: kb/paper_cards/mot_2025.yaml

Backbone: Sparse multimodal transformer with modality-aware FFNs/attention
Modalities: Text, images, speech
Sparsity mechanism: Separate FFN/attention projections per modality; shared global self-attention
Settings: Chameleon-style autoregressive + Transfusion-style diffusion
Efficiency: ~55.8% FLOPs of dense baseline, similar or better performance

Key Design Patterns¶

✅ Modality-aware sparsity: Decouple non-embedding parameters by modality
✅ Shared global attention: All tokens interact via self-attention (no routing)
✅ Drop-in replacement: Compatible with existing dense transformer architectures
✅ Stable scaling: Maintains performance across model sizes (1B → 7B → 30B)

Implications for Brain-Omics Models¶

Direct applications: - Per-modality FFNs: Separate feed-forward networks for genetics, brain MRI, fMRI, EEG, behavior tokens - Shared attention: Global self-attention over all modalities captures cross-modal dependencies - Compute efficiency: Critical for scaling to large cohorts (UK Biobank N=500k+)

Integration with Neuro-Omics KB: - Implement modality-specific projectors (genetics_ffn, brain_ffn, behavior_ffn) - Retain shared attention over concatenated gene+brain+behavior tokens - Compare vs. learned MoE routing (simpler, more interpretable)

Reference materials: - MoT walkthrough - MoT paper card

3. M3FM — Multilingual Medical Model¶

Architecture Summary¶

Model: M3FM (Multilingual Chest X-ray Report Generator)
Repo: ai-in-health/M3FM | Card: kb/model_cards/m3fm.yaml

Backbone: Multilingual CLIP encoder + relational-memory Transformer decoder
Modalities: Chest X-ray images, bilingual text (English/Chinese)
Architecture: Two-tower (vision encoder + language decoder) with relational memory
Decoder: Language selection via BOS token (1=English, 2=Chinese)
Training: COV-CTR COVID-era CXR dataset with multilingual reports

Key Design Patterns¶

✅ Two-tower fusion: Vision encoder outputs → cross-attention in language decoder
✅ Language-aware generation: Single decoder handles multiple languages via BOS conditioning
✅ Relational memory: Augmented attention for capturing long-range report dependencies
✅ Medical domain adaptation: CLIP text embeddings projected for medical terminology

Implications for Brain-Omics Models¶

Direct applications: - Brain-omics-to-language: Project brain/genetics embeddings into CLIP-like space → generate clinical narratives - Bilingual reporting: Extend to English/Korean for Cha Hospital developmental cohorts - Relational memory for clinical context: Track longitudinal patient history across visits

Integration strategy: - Use M3FM-style two-tower for brain scan → clinical report generation - Adapt relational memory for multi-visit longitudinal modeling - Explore gene embedding → language generation (explain genetic risk in natural language)

Reference materials: - M3FM walkthrough - M3FM model card

4. Me-LLaMA — Medical LLM¶

Architecture Summary¶

Model: Me-LLaMA (Medical LLaMA)
Repo: BIDS-Xu-Lab/Me-LLaMA | Card: kb/model_cards/me_llama.yaml

Backbone: LLaMA-2/3 (13B/70B) with continual pretraining + LoRA instruction tuning
Modality: Medical text (biomedical literature, clinical notes, guidelines)
Pretraining ratio: 15:1:4 (biomedical : clinical : general)
Training: 129B medical tokens + 214K instruction samples
Evaluation: 12+ medical QA/NLP tasks with prompt templates

Key Design Patterns¶

✅ Continual pretraining: Adapt general LLM to medical domain with curated corpus
✅ LoRA instruction tuning: Parameter-efficient adaptation for clinical reasoning
✅ Prompt engineering: Modality-specific prompts for different clinical tasks
✅ Evaluation harness: Structured benchmarking across medical NLP tasks

Implications for Brain-Omics Models¶

Direct applications: - Neuro-omics LLM: Continual pretrain LLaMA on neuroscience literature + genetics papers + clinical neurology notes - Instruction tuning for clinical tasks: Adapt for cognitive assessment interpretation, genetic counseling, neuroimaging report generation - Prompt templates: Create standardized prompts for gene-brain-behavior reasoning

As semantic bridge in BOM: - Me-LLaMA-style medical LLM serves as semantic hub for Brain-Omics Model - Project genetics/brain/EEG embeddings into LLM token space for cross-modal reasoning - Enable natural language queries over multimodal neuro-omics data

Reference materials: - Me-LLaMA walkthrough - Me-LLaMA model card

5. TITAN — Whole-Slide Image FM¶

Architecture Summary¶

Model: TITAN (Transformer for Integrative Tissue Analysis)
Repo: mahmoodlab/TITAN | Card: kb/model_cards/titan.yaml

Backbone: Slide-level transformer with multi-scale patch aggregation
Modality: Whole-slide histopathology images
Architecture: Hierarchical attention over gigapixel images (millions of patches)
Applications: Cancer diagnosis, survival prediction, treatment response

Key Design Patterns¶

✅ Multi-scale patch processing: Handle gigapixel images via hierarchical aggregation
✅ Attention-based pooling: Learn to aggregate informative regions
✅ Slide-level embeddings: Compress millions of patches → fixed-size vectors
✅ Task-specific heads: Shared encoder for multiple downstream tasks

Implications for Brain-Omics Models¶

Direct applications: - Brain MRI analogy: Whole-brain 3D volumes → hierarchical patch aggregation (similar to TITAN's slide processing) - Multi-scale fusion: Combine region-level (parcels) and voxel-level (fine-grained) brain features - Histology + genetics: If histopathology data available (e.g., brain tissue banks), TITAN-style processing + genetics fusion

Integration with Neuro-Omics KB: - Adapt TITAN's multi-scale attention for 3D MRI volumes - Use TITAN-style patch aggregation for whole-brain sMRI + fMRI fusion - Explore cross-modal attention: pathology patches ↔ genetic variants

Reference materials: - TITAN walkthrough - TITAN model card

6. FMS-Medical Catalog¶

Resource Summary¶

Catalog: Awesome Foundation Models for Advancing Healthcare
Repo: YutingHe-list/Awesome-Foundation-Models

Scope: 200+ medical foundation models across modalities (text, vision, multimodal, protein, genomics, clinical time series)
Organization: Bilingual (English/Chinese) with taxonomy by modality and task
Usage: Reference catalog for discovering relevant medical FMs

Key Resources¶

✅ Medical vision FMs: CXR, CT, MRI, histopathology encoders
✅ Medical LLMs: Clinical text understanding and generation models
✅ Genomics/proteomics FMs: Sequence models for molecular biology
✅ Multimodal FMs: Vision-language models for radiology, pathology reports

Implications for Brain-Omics Models¶

Discovery and benchmarking: - Identify relevant medical imaging FMs for brain scan processing - Find medical LLMs for clinical narrative generation - Discover multimodal architectures to adapt for gene-brain-behavior fusion

Reference for ARPA-H integration: - Survey multimodal medical FMs to inform BOM architecture choices - Benchmark against medical FM baselines (e.g., CXR report generation → adapt for neuroimaging)

Reference materials: - FMS-Medical walkthrough - FMS-Medical catalog YAML

Integration Roadmap for Neuro-Omics KB¶

Phase 1: Late Fusion Baselines (Current)¶

Models: Separate encoders (Caduceus, BrainLM, FreeSurfer ROIs)
Fusion: Concatenate embeddings → LR/GBDT prediction
Evaluation: CCA + permutation, AUROC/AUPRC, DeLong tests

Phase 2: Two-Tower Contrastive¶

Architecture: Frozen gene encoder ↔ frozen brain encoder with learnable projectors
Loss: InfoNCE or similar contrastive objective
Inspiration: CLIP-style alignment (M3FM two-tower paradigm)

Phase 3: MoT-Style Sparse Integration¶

Architecture: Shared self-attention over gene+brain+behavior tokens
Sparsity: Modality-specific FFNs (genetics_ffn, brain_ffn, behavior_ffn)
Inspiration: MoT paper (arXiv:2411.04996)

Phase 4: Unified Brain-Omics Model (BOM)¶

Architecture: BAGEL-style decoder-only with MoT experts
Modalities: Genetics (nucleotide tokens) + brain (parcel/voxel tokens) + behavior (structured tokens) + language (text tokens)
Semantic bridge: Me-LLaMA-style medical LLM as central hub
Training: Interleaved multimodal corpus (genetic variants + brain scans + cognitive assessments + clinical notes)

Next Steps¶

Complete Phase 1 baselines (CCA + prediction on UKB gene-brain data)
Pilot two-tower contrastive (gene-brain alignment with frozen encoders)
Explore MoT-style sparsity (modality-specific FFNs vs. full early fusion)
Design ARPA-H BOM architecture (unified multimodal transformer with neuro-omics tokens)
Curate interleaved corpus (multimodal neuro-omics data for unified pretraining)

Reference Index¶

Walkthrough documents: - BAGEL walkthrough - MoT walkthrough - M3FM walkthrough - Me-LLaMA walkthrough - TITAN walkthrough - FMS-Medical walkthrough

Paper/model cards: - kb/paper_cards/bagel_2025.yaml - kb/paper_cards/mot_2025.yaml - kb/model_cards/m3fm.yaml - kb/model_cards/me_llama.yaml - kb/model_cards/titan.yaml - kb/datasets/fms_medical_catalog.yaml

Integration recipes: - Integration Strategy - Design Patterns - CCA + Permutation