π§¬π§ Neuro-Omics Knowledge Base¶
A comprehensive documentation hub for genetics and brain foundation models and their multimodal integration.
π Team Guide | π KB Overview | 𧬠Genetics Models | π§ Brain Models | π Integration Strategy | π» GitHub
π― What is this?¶
A documentation-first knowledge base for researchers working with:
- 𧬠Genetic foundation models β Caduceus, DNABERT-2, Evo2, GENERator
- π§ Brain imaging models β BrainLM, Brain-JEPA, BrainMT, Brain Harmony, SwiFT
- π₯ Multimodal/Clinical models β BAGEL, MoT, M3FM, Me-LLaMA, TITAN, FMS-Medical
- π Integration strategies β Gene-brain-behavior-language analysis
Scope: Documentation, metadata cards, and integration patterns β not model implementation code.
π Quick Start¶
# 1. Clone and setup
git clone https://github.com/allison-eunse/neuro-omics-kb.git
cd neuro-omics-kb
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 2. View documentation locally
mkdocs serve # Visit http://localhost:8000
# 3. Validate metadata cards
python scripts/manage_kb.py validate models
New to foundation models? β Start with:
- π KB Overview
- 𧬠Genetics Models
- π§ Brain Models
- π Integration Strategy
π‘ Use Cases¶
β Genetics research¶
- Turn DNA sequences into strand-robust gene embeddings (Caduceus, DNABERT-2, Evo 2, GENERator)
- Compare variant effect predictors or run LOGO attribution with standardized configs
- Hand off vetted embeddings to integration pipelines without reimplementing data hygiene
Go deeper: Explore Genetics Models
β Brain imaging¶
- Preprocess fMRI/sMRI cohorts, harmonize sites, and extract embeddings (BrainLM, Brain-JEPA, Brain Harmony, BrainMT, SwiFT)
- Control residualization/motion covariates before fusion experiments
- Swap projection heads or pooling strategies without touching raw scans
Go deeper: Explore Brain Models
β Multimodal integration¶
- Follow the late-fusion-first playbook (CCA + permutations, LR/GBDT fusion, contrastive escalation)
- Track embedding/processing provenance through integration cards and decision logs
- Plug in recipe-ready configs for CCA, prediction baselines, or partial correlations
Go deeper: Explore Integration Strategy
β Clinical & multimodal FMs¶
- Reuse BAGEL, MoT, M3FM, Me-LLaMA, TITAN, and FMS-Medical code walkthroughs as reference builds
- Understand how visionβlanguage or sparse MoE systems align modalities before adapting to neuro-omics
- Borrow evaluation scaffolding for bilingual or imagingβtext setups
Go deeper: Explore Multimodal Models
β Reproducible research guardrails¶
- Start from vetted configs (
configs/experiments/*) with stratified CV and QC baked in - Run codified validation steps (
scripts/manage_kb.py,codex_gate.py) before sharing outputs - Use analysis recipes as living SOPs for cohorts, baselines, and integration checkpoints
Go deeper: Explore Analysis Recipes
π¦ What's Inside¶
π Documentation β Code Walkthroughs, playbooks, decision logs
docs/βββ code_walkthroughs/ β 15 guided FM tours
β βββ Genetics (4): Caduceus, DNABERT-2, Evo 2, GENERator
β βββ Brain (5): BrainLM, Brain-JEPA, Brain Harmony, BrainMT, SwiFT
β βββ Multimodal (6): BAGEL, MoT, M3FM, Me-LLaMA, TITAN, FMS-Medical
βββ integration/ β Fusion strategies, design patterns, benchmarks
βββ data/ β UKB data map, QC protocols, schemas
βββ decisions/ β Integration plans, validation rationale
π·οΈ Metadata Cards β Structured YAML for all assets
kb/βββ model_cards/ β 20 FM specs (17 FMs + 3 reference)
βββ paper_cards/ β 30 research papers with structured takeaways
βββ datasets/ β 19 dataset schemas (UKB, HCP, Cha, benchmarks)
βββ integration_cards/ β 6 integration recipes (embedding + harmonization)
Foundation model specifications
Architecture details, parameters, integration hooks
Caduceus, BrainLM, BAGEL, MoT
Research paper summaries
Key takeaways, implementation notes
RC-equivariance, MURD, Ensemble Integration
Data schema definitions
Preprocessing requirements, access protocols
UKB fMRI/sMRI, HCP, Genomic Benchmarks
Embedding & fusion recipes
Extraction pipelines, harmonization methods
genetics_joo_mdd_cog_v1, murd_t1_t2, combat_smri
π§ Tools & Scripts β Validation, quality gates, sync
scripts/βββ manage_kb.py β Validate YAML, query embeddings/harmonization
βββ codex_gate.py β Pre-commit quality sweeps
βββ fetch_external_repos.sh β Sync upstream FM repos
βοΈ Experiment Configs β Ready-to-run templates
configs/experiments/βββ 01_cca_gene_smri.yaml β CCA + permutation baseline
βββ 01_cca_gene_fmri_roi_mean.yaml β Gene Γ fMRI ROI-mean CCA/SCCA + permutations
βββ 02_prediction_baselines.yaml β Gene vs Brain vs Fusion (LR/GBDT)
βββ 03_logo_gene_attribution.yaml β Leave-one-gene-out ΞAUC
βββ dev_* templates β CHA cohort dev stubs
π― Foundation Model Registry¶
𧬠Genetics Models¶
| Model | Best For | Context | Code Walkthrough |
|---|---|---|---|
| Caduceus | RC-equivariant gene embeddings | DNA | Code Walkthrough β |
| DNABERT-2 | Cross-species transfer | BPE | Code Walkthrough β |
| Evo 2 | Ultra-long regulatory regions | 1M context | Code Walkthrough β |
| GENERator | Generative modeling | 6-mer LM | Code Walkthrough β |
| HyenaDNA | Long-range sequences | 1M context | Code Walkthrough β |
π§ Brain Models¶
| Model | Modality | Best For | Code Walkthrough |
|---|---|---|---|
| BrainLM | fMRI | Site-robust embeddings | Code Walkthrough β |
| Brain-JEPA | fMRI | Lower-latency option | Code Walkthrough β |
| Brain Harmony | sMRI + fMRI | Multi-modal fusion | Code Walkthrough β |
| BrainMT | sMRI/fMRI | Mamba efficiency | Code Walkthrough β |
| SwiFT | fMRI | Hierarchical spatiotemporal | Code Walkthrough β |
π₯ Multimodal & Clinical Models¶
| Model | Type | Key Innovation | Code Walkthrough |
|---|---|---|---|
| BAGEL | Unified FM | MoT experts (understand + generate) | Code Walkthrough β |
| MoT | Sparse | Modality-aware sparsity (~55% FLOPs) | Code Walkthrough β |
| M3FM | Radiology | CXR/CT + bilingual (EN/CN) | Code Walkthrough β |
| Me-LLaMA | Medical LLM | Continual pretrained (129B tok) | Code Walkthrough β |
| TITAN | Pathology | Gigapixel whole-slide | Code Walkthrough β |
| Flamingo | VLM | Visual-language few-shot | Code Walkthrough β |
| FMS-Medical | Catalog | Medical FM survey | Code Walkthrough β |
π Explore Multimodal Models Overview β’ Multimodal Architectures Guide β’ Design Patterns
π Integration Stack¶
β Core Strategy: Integration Strategy
β Analysis Recipes: CCA + permutation Β· Prediction baselines Β· Partial correlations
β Modality Features: Genomics Β· sMRI Β· fMRI
β Design Patterns: Design patterns Β· Multimodal architectures
Integration Roadmap:
β If fusion wins significantly
β Two-Tower Contrastive
β If gains plateau
β EI Stacking / Hub Tokens
β Last resort
β Full Early Fusion (TAPE-style)
Decisions: Integration baseline plan (Nov 2025)
π Research Papers¶
Every paper has three quick links: KB summary (MD) Β· Annotated PDF Β· Original publication
Genetics Foundation Models¶
| Paper | Notes | Source | Focus |
|---|---|---|---|
| Caduceus | MD Β· PDF | arXiv:2403.03234 | RC-equivariant BiMamba DNA FM |
| DNABERT-2 | MD Β· PDF | arXiv:2306.15006 | BPE-tokenized multi-species encoder |
| Evo 2 | MD Β· PDF | bioRxiv 2025.02.18 | StripedHyena 1M-token model |
| GENERator | MD Β· PDF | arXiv:2502.07272 | Generative 6-mer DNA LM |
Brain Foundation Models¶
| Paper | Notes | Source | Focus |
|---|---|---|---|
| BrainLM | MD Β· PDF | OpenReview RwI7ZEfR27 | ViT-MAE for UKB fMRI |
| Brain-JEPA | MD Β· PDF | arXiv:2409.19407 | Joint-embedding prediction |
| Brain Harmony | MD Β· PDF | arXiv:2509.24693 | sMRI+fMRI fusion with TAPE |
| BrainMT | MD Β· PDF | LNCS 10.1007/β¦-2_15 | Hybrid Mamba-Transformer |
| SwiFT | MD Β· PDF | arXiv:2307.05916 | Swin-style 4D fMRI |
Multimodal & Clinical Foundation Models¶
| Paper | Notes | Source | Focus |
|---|---|---|---|
| BAGEL | MD Β· PDF | arXiv:2505.14683 | Unified MoT decoder |
| MoT | MD Β· PDF | arXiv:2411.04996 | Modality-aware sparse transformers |
| M3FM | MD Β· PDF | npj Digital Medicine 2025 | Multilingual medical vision-language |
| Me-LLaMA | MD Β· PDF | arXiv:2404.05416 | Medical LLM continual-pretraining |
| TITAN | MD Β· PDF | Nature Medicine 2025 | Gigapixel whole-slide pathology |
| MM FMs Survey | MD Β· PDF | AI in Medicine 2025 | Clinical MM FM patterns |
Integration & Methods¶
| Paper | Notes | Source | Focus |
|---|---|---|---|
| Ensemble Integration | MD Β· PDF | doi:10.1093/bioadv/vbac065 | Late-fusion rationale |
| Oncology Multimodal | MD Β· PDF | PubMed 39118787 | Confounds & protocols |
| Yoon BIOKDD 2025 | MD Β· PDF | bioRxiv 2025.02.18 | LOGO attribution |
| GWAS Diverse Populations | MD Β· PDF | PubMed 36158455 | Ancestry-aware QC |
| PRS Guide | MD Β· PDF | PubMed 31607513 | Polygenic risk reporting |
π Data & Schemas¶
| Resource | Description | Link |
|---|---|---|
| UKB Data Map | Field mappings, cohort definitions | View |
| Governance & QC | Quality control protocols, IRB guidelines | View |
| Subject Keys | ID management and anonymization | View |
| Schemas | Data format specifications | View |
| FMS-Medical Catalog | 100+ medical FM references | View |
ποΈ KB Assets¶
-
:material-file-document: Model Cards
15 model cards: 13 foundation models + 2 ARPA-H planning cards
-
:material-book-open-page-variant: Paper Cards
Structured summaries of 20 key papers with integration hooks
-
:material-database: Dataset Cards
Data source specifications for UKB, HCP, and benchmarks
-
:material-link-variant: Integration Cards
Cross-modal fusion patterns and actionable guidance
βοΈ Experiment Configs¶
Ready-to-use analysis templates with validation schemas:
| Template | Purpose | Key Features |
|---|---|---|
| 01_cca_gene_smri | CCA + permutation baseline | Cross-modal null distributions, p-values |
| 01_cca_gene_fmri_roi_mean | Gene Γ fMRI ROI-mean CCA/SCCA | Pooling ablations, ROI-mean baseline, permutation p-values |
| 02_prediction_baselines | Gene vs Brain vs Fusion | LR/GBDT comparison, DeLong tests |
| 03_logo_gene_attribution | LOGO ΞAUC protocol | Leave-one-gene-out attribution |
β Explore Experiment Configs
π Standard Pipeline¶
graph LR
A[Raw Data] --> B[Z-score normalization]
B --> C[Residualize confounds]
C --> D[512-D projection]
D --> E{Analysis Type}
E -->|Structure| F[CCA + permutations]
E -->|Prediction| G[LR/GBDT fusion]
F --> H[Statistical tests]
G --> H
H --> I[Results + validation]
Always Residualize
Confounds to control: - Age, sex, site/scanner - Motion (mean FD for fMRI) - SES, genetic PCs - Batch effects
Start with CCA + Permutation
CCA always returns non-zero correlations, even on shuffled data. The permutation test builds a null distribution by re-fitting after within-fold shuffling, giving you p-values to avoid over-interpreting noiseβcritical when sites share confounds.
π οΈ Typical Workflow¶
- π Explore β Browse model cards and paper summaries
- π Select β Choose appropriate FMs for your modalities
- βοΈ Configure β Clone experiment config template
- βΆοΈ Run β Extract embeddings and run analysis
- β
Validate β Use quality gates (
manage_kb.py) - π Document β Log results back to KB
Need help? Check the KB Overview or explore Code Walkthroughs