Skip to content

🧬🧠 Neuro-Omics Knowledge Base

Documentation Models Paper Cards Datasets

A comprehensive documentation hub for genetics and brain foundation models and their multimodal integration.

πŸš€ Team Guide | πŸ“– KB Overview | 🧬 Genetics Models | 🧠 Brain Models | πŸ”— Integration Strategy | πŸ’» GitHub


🎯 What is this?

A documentation-first knowledge base for researchers working with:

  • 🧬 Genetic foundation models β€” Caduceus, DNABERT-2, Evo2, GENERator
  • 🧠 Brain imaging models β€” BrainLM, Brain-JEPA, BrainMT, Brain Harmony, SwiFT
  • πŸ₯ Multimodal/Clinical models β€” BAGEL, MoT, M3FM, Me-LLaMA, TITAN, FMS-Medical
  • πŸ”— Integration strategies β€” Gene-brain-behavior-language analysis

Scope: Documentation, metadata cards, and integration patterns β€” not model implementation code.


πŸš€ Quick Start

# 1. Clone and setup
git clone https://github.com/allison-eunse/neuro-omics-kb.git
cd neuro-omics-kb
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. View documentation locally
mkdocs serve  # Visit http://localhost:8000

# 3. Validate metadata cards
python scripts/manage_kb.py validate models

New to foundation models? β†’ Start with:

  1. πŸ“– KB Overview
  2. 🧬 Genetics Models
  3. 🧠 Brain Models
  4. πŸ”— Integration Strategy

πŸ’‘ Use Cases

β†’ Genetics research

  • Turn DNA sequences into strand-robust gene embeddings (Caduceus, DNABERT-2, Evo 2, GENERator)
  • Compare variant effect predictors or run LOGO attribution with standardized configs
  • Hand off vetted embeddings to integration pipelines without reimplementing data hygiene

Go deeper: Explore Genetics Models

β†’ Brain imaging

  • Preprocess fMRI/sMRI cohorts, harmonize sites, and extract embeddings (BrainLM, Brain-JEPA, Brain Harmony, BrainMT, SwiFT)
  • Control residualization/motion covariates before fusion experiments
  • Swap projection heads or pooling strategies without touching raw scans

Go deeper: Explore Brain Models

β†’ Multimodal integration

  • Follow the late-fusion-first playbook (CCA + permutations, LR/GBDT fusion, contrastive escalation)
  • Track embedding/processing provenance through integration cards and decision logs
  • Plug in recipe-ready configs for CCA, prediction baselines, or partial correlations

Go deeper: Explore Integration Strategy

β†’ Clinical & multimodal FMs

  • Reuse BAGEL, MoT, M3FM, Me-LLaMA, TITAN, and FMS-Medical code walkthroughs as reference builds
  • Understand how vision–language or sparse MoE systems align modalities before adapting to neuro-omics
  • Borrow evaluation scaffolding for bilingual or imaging–text setups

Go deeper: Explore Multimodal Models

β†’ Reproducible research guardrails

  • Start from vetted configs (configs/experiments/*) with stratified CV and QC baked in
  • Run codified validation steps (scripts/manage_kb.py, codex_gate.py) before sharing outputs
  • Use analysis recipes as living SOPs for cohorts, baselines, and integration checkpoints

Go deeper: Explore Analysis Recipes


πŸ“¦ What's Inside

πŸ“š Documentation β€” Code Walkthroughs, playbooks, decision logs
docs/
β”œβ”€β”€ code_walkthroughs/ ← 15 guided FM tours
β”‚ β”œβ”€β”€ Genetics (4): Caduceus, DNABERT-2, Evo 2, GENERator
β”‚ β”œβ”€β”€ Brain (5): BrainLM, Brain-JEPA, Brain Harmony, BrainMT, SwiFT
β”‚ └── Multimodal (6): BAGEL, MoT, M3FM, Me-LLaMA, TITAN, FMS-Medical
β”œβ”€β”€ integration/ ← Fusion strategies, design patterns, benchmarks
β”œβ”€β”€ data/ ← UKB data map, QC protocols, schemas
└── decisions/ ← Integration plans, validation rationale
Code Walkthroughs, schemas, and decision logs share the same terminology across genetics, brain, and multimodal FMs.
🏷️ Metadata Cards β€” Structured YAML for all assets
kb/
β”œβ”€β”€ model_cards/ ← 20 FM specs (17 FMs + 3 reference)
β”œβ”€β”€ paper_cards/ ← 30 research papers with structured takeaways
β”œβ”€β”€ datasets/ ← 19 dataset schemas (UKB, HCP, Cha, benchmarks)
└── integration_cards/ ← 6 integration recipes (embedding + harmonization)
**What's in each folder:**
πŸ“¦ model_cards/
Foundation model specifications
Architecture details, parameters, integration hooks
Caduceus, BrainLM, BAGEL, MoT
πŸ“„ paper_cards/
Research paper summaries
Key takeaways, implementation notes
RC-equivariance, MURD, Ensemble Integration
πŸ—‚οΈ datasets/
Data schema definitions
Preprocessing requirements, access protocols
UKB fMRI/sMRI, HCP, Genomic Benchmarks
πŸ”— integration_cards/
Embedding & fusion recipes
Extraction pipelines, harmonization methods
genetics_joo_mdd_cog_v1, murd_t1_t2, combat_smri
[Browse all cards on GitHub β†’](https://github.com/allison-eunse/neuro-omics-kb/tree/main/kb)
πŸ”§ Tools & Scripts β€” Validation, quality gates, sync
scripts/
β”œβ”€β”€ manage_kb.py ← Validate YAML, query embeddings/harmonization
β”œβ”€β”€ codex_gate.py ← Pre-commit quality sweeps
└── fetch_external_repos.sh ← Sync upstream FM repos
Pair these with `verify_kb.sh` or `mkdocs serve` during review cycles.
βš™οΈ Experiment Configs β€” Ready-to-run templates
configs/experiments/
β”œβ”€β”€ 01_cca_gene_smri.yaml ← CCA + permutation baseline
β”œβ”€β”€ 01_cca_gene_fmri_roi_mean.yaml ← Gene Γ— fMRI ROI-mean CCA/SCCA + permutations
β”œβ”€β”€ 02_prediction_baselines.yaml ← Gene vs Brain vs Fusion (LR/GBDT)
β”œβ”€β”€ 03_logo_gene_attribution.yaml ← Leave-one-gene-out Ξ”AUC
└── dev_* templates ← CHA cohort dev stubs
Each config references the exact embeddings, covariates, and validation plan to keep runs reproducible.

🎯 Foundation Model Registry

🧬 Genetics Models

Model Best For Context Code Walkthrough
Caduceus RC-equivariant gene embeddings DNA Code Walkthrough β†’
DNABERT-2 Cross-species transfer BPE Code Walkthrough β†’
Evo 2 Ultra-long regulatory regions 1M context Code Walkthrough β†’
GENERator Generative modeling 6-mer LM Code Walkthrough β†’
HyenaDNA Long-range sequences 1M context Code Walkthrough β†’

🧠 Brain Models

Model Modality Best For Code Walkthrough
BrainLM fMRI Site-robust embeddings Code Walkthrough β†’
Brain-JEPA fMRI Lower-latency option Code Walkthrough β†’
Brain Harmony sMRI + fMRI Multi-modal fusion Code Walkthrough β†’
BrainMT sMRI/fMRI Mamba efficiency Code Walkthrough β†’
SwiFT fMRI Hierarchical spatiotemporal Code Walkthrough β†’

πŸ₯ Multimodal & Clinical Models

Model Type Key Innovation Code Walkthrough
BAGEL Unified FM MoT experts (understand + generate) Code Walkthrough β†’
MoT Sparse Modality-aware sparsity (~55% FLOPs) Code Walkthrough β†’
M3FM Radiology CXR/CT + bilingual (EN/CN) Code Walkthrough β†’
Me-LLaMA Medical LLM Continual pretrained (129B tok) Code Walkthrough β†’
TITAN Pathology Gigapixel whole-slide Code Walkthrough β†’
Flamingo VLM Visual-language few-shot Code Walkthrough β†’
FMS-Medical Catalog Medical FM survey Code Walkthrough β†’

πŸ“– Explore Multimodal Models Overview β€’ Multimodal Architectures Guide β€’ Design Patterns


πŸ”— Integration Stack

β†’ Core Strategy: Integration Strategy
β†’ Analysis Recipes: CCA + permutation Β· Prediction baselines Β· Partial correlations
β†’ Modality Features: Genomics Β· sMRI Β· fMRI
β†’ Design Patterns: Design patterns Β· Multimodal architectures

Integration Roadmap:

● Late Fusion (baseline)
      β†“ If fusion wins significantly
● Two-Tower Contrastive
      β†“ If gains plateau
● EI Stacking / Hub Tokens
      β†“ Last resort
● Full Early Fusion (TAPE-style)

Decisions: Integration baseline plan (Nov 2025)


πŸ“‹ Research Papers

Every paper has three quick links: KB summary (MD) Β· Annotated PDF Β· Original publication

Genetics Foundation Models

Paper Notes Source Focus
Caduceus MD Β· PDF arXiv:2403.03234 RC-equivariant BiMamba DNA FM
DNABERT-2 MD Β· PDF arXiv:2306.15006 BPE-tokenized multi-species encoder
Evo 2 MD Β· PDF bioRxiv 2025.02.18 StripedHyena 1M-token model
GENERator MD Β· PDF arXiv:2502.07272 Generative 6-mer DNA LM

Brain Foundation Models

Paper Notes Source Focus
BrainLM MD Β· PDF OpenReview RwI7ZEfR27 ViT-MAE for UKB fMRI
Brain-JEPA MD Β· PDF arXiv:2409.19407 Joint-embedding prediction
Brain Harmony MD Β· PDF arXiv:2509.24693 sMRI+fMRI fusion with TAPE
BrainMT MD Β· PDF LNCS 10.1007/…-2_15 Hybrid Mamba-Transformer
SwiFT MD Β· PDF arXiv:2307.05916 Swin-style 4D fMRI

Multimodal & Clinical Foundation Models

Paper Notes Source Focus
BAGEL MD Β· PDF arXiv:2505.14683 Unified MoT decoder
MoT MD Β· PDF arXiv:2411.04996 Modality-aware sparse transformers
M3FM MD Β· PDF npj Digital Medicine 2025 Multilingual medical vision-language
Me-LLaMA MD Β· PDF arXiv:2404.05416 Medical LLM continual-pretraining
TITAN MD Β· PDF Nature Medicine 2025 Gigapixel whole-slide pathology
MM FMs Survey MD Β· PDF AI in Medicine 2025 Clinical MM FM patterns

Integration & Methods

Paper Notes Source Focus
Ensemble Integration MD Β· PDF doi:10.1093/bioadv/vbac065 Late-fusion rationale
Oncology Multimodal MD Β· PDF PubMed 39118787 Confounds & protocols
Yoon BIOKDD 2025 MD Β· PDF bioRxiv 2025.02.18 LOGO attribution
GWAS Diverse Populations MD Β· PDF PubMed 36158455 Ancestry-aware QC
PRS Guide MD Β· PDF PubMed 31607513 Polygenic risk reporting

πŸ“Š Data & Schemas

Resource Description Link
UKB Data Map Field mappings, cohort definitions View
Governance & QC Quality control protocols, IRB guidelines View
Subject Keys ID management and anonymization View
Schemas Data format specifications View
FMS-Medical Catalog 100+ medical FM references View

πŸ—‚οΈ KB Assets

  • :material-file-document: Model Cards


    15 model cards: 13 foundation models + 2 ARPA-H planning cards

    Browse on GitHub

  • :material-book-open-page-variant: Paper Cards


    Structured summaries of 20 key papers with integration hooks

    Browse on GitHub

  • :material-database: Dataset Cards


    Data source specifications for UKB, HCP, and benchmarks

    Browse on GitHub

  • :material-link-variant: Integration Cards


    Cross-modal fusion patterns and actionable guidance

    Browse on GitHub


βš™οΈ Experiment Configs

Ready-to-use analysis templates with validation schemas:

Template Purpose Key Features
01_cca_gene_smri CCA + permutation baseline Cross-modal null distributions, p-values
01_cca_gene_fmri_roi_mean Gene Γ— fMRI ROI-mean CCA/SCCA Pooling ablations, ROI-mean baseline, permutation p-values
02_prediction_baselines Gene vs Brain vs Fusion LR/GBDT comparison, DeLong tests
03_logo_gene_attribution LOGO Ξ”AUC protocol Leave-one-gene-out attribution

β†’ Explore Experiment Configs


πŸš€ Standard Pipeline

graph LR
    A[Raw Data] --> B[Z-score normalization]
    B --> C[Residualize confounds]
    C --> D[512-D projection]
    D --> E{Analysis Type}
    E -->|Structure| F[CCA + permutations]
    E -->|Prediction| G[LR/GBDT fusion]
    F --> H[Statistical tests]
    G --> H
    H --> I[Results + validation]

Always Residualize

Confounds to control: - Age, sex, site/scanner - Motion (mean FD for fMRI) - SES, genetic PCs - Batch effects

Start with CCA + Permutation

CCA always returns non-zero correlations, even on shuffled data. The permutation test builds a null distribution by re-fitting after within-fold shuffling, giving you p-values to avoid over-interpreting noiseβ€”critical when sites share confounds.


πŸ› οΈ Typical Workflow

  1. πŸ“– Explore β€” Browse model cards and paper summaries
  2. πŸ” Select β€” Choose appropriate FMs for your modalities
  3. βš™οΈ Configure β€” Clone experiment config template
  4. ▢️ Run β€” Extract embeddings and run analysis
  5. βœ… Validate β€” Use quality gates (manage_kb.py)
  6. πŸ“ Document β€” Log results back to KB

Need help? Check the KB Overview or explore Code Walkthroughs