Skip to content

🧬🧠 Team User Guide

For: Lab members working on Brain–Genetics FM integration

Quick Start for New Team Members

  1. Read this guide (you're here!)
  2. Browse the Integration Strategy
  3. Check experiment configs

🎯 What This Repo Does

This is your documentation-first knowledge base β€” the map and spec for the Brain–Genetics program.

Repository Structure

kb/
β”œβ”€β”€ model_cards/ ← 20 FM specs (17 FMs + 3 reference)
β”œβ”€β”€ paper_cards/ ← 30 research papers with structured takeaways
β”œβ”€β”€ datasets/ ← 19 dataset schemas (UKB, HCP, Cha, benchmarks)
└── integration_cards/ ← 6 integration recipes

What's Documented

Category What It Is Examples
Genetics FMs DNA sequence foundation models for gene-level embeddings Caduceus, DNABERT-2, Evo 2, HyenaDNA, GENERator
Brain FMs Neuroimaging models for fMRI/sMRI subject embeddings BrainLM, Brain-JEPA, BrainMT, Brain Harmony, SwiFT
Multimodal FMs Clinical & unified multimodal architecture references BAGEL, MoT, M3FM, Me-LLaMA, TITAN, Flamingo, FMS-Medical
Research Papers Curated paper summaries with implementation notes RC-equivariance, Ensemble Integration, MURD, Yoon BioKDD'25
Datasets Data schema specs and preprocessing protocols UKB (fMRI, sMRI, WES), HCP, Cha developmental, benchmarks
Integration & Strategy Embedding recipes, harmonization, fusion playbooks genetics_joo_mdd_cog_v1, murd_t1_t2, CCA + permutation

The Playbook

Strategy: Late fusion β†’ Two-tower contrastive β†’ MoT/unified BOM

Phase When What
Stage 1 Now Per-modality FMs + 512-D embeddings + late fusion
Stage 2 If fusion wins Two-tower contrastive / EI stacking
Stage 3 Long-term MoT/BAGEL unified architectures

πŸ“‹ Canonical Embedding Recipes

All recipes defined in kb/integration_cards/embedding_strategies.yaml

Query any recipe: python scripts/manage_kb.py ops strategy <recipe_id>

Recipe ID Type Output Pipeline
genetics_gene_fm_pca512_v1 genetics 512-D Caduceus/DNABERT-2/Evo2 + RC-averaging
genetics_joo_mdd_cog_v1 genetics 512-D Prof. Joo's 38 MDD genes ⭐
smri_free_surfer_pca512_v1 brain 512-D FreeSurfer ROIs β†’ residualize β†’ PCA
rsfmri_swift_segments_v1 brain 512-D SwiFT segments β†’ mean pool β†’ PCA
rsfmri_brainlm_segments_v1 brain 512-D BrainLM CLS tokens β†’ mean pool
fusion_concat_gene_brain_1024_v1 fusion 1024-D Concat(Gene₅₁₂ + Brain₅₁₂)

⭐ = Recommended starting point

Query a recipe:

python scripts/manage_kb.py ops strategy genetics_joo_mdd_cog_v1

πŸ—ΊοΈ How to Navigate

β†’ "I need to understand a specific FM"

Example: Understanding Caduceus

  1. Overview: Caduceus model docs
  2. Step-by-step: Caduceus Code Walkthrough
  3. Code: external_repos/caduceus/
  4. Metadata: kb/model_cards/caduceus.yaml

β†’ "I want to run CCA / prediction baselines"

  1. Read the playbook: docs/integration/integration_strategy.md
  2. Check the recipe: kb/integration_cards/embedding_strategies.yaml
  3. Use the config: configs/experiments/01_cca_gene_smri.yaml or 02_prediction_baselines.yaml

β†’ "How do I preprocess [modality]?"

  • Genetics: docs/integration/modality_features/genomics.md
  • sMRI: docs/integration/modality_features/smri.md
  • fMRI: docs/integration/modality_features/fmri.md

β†’ "Which harmonization method?"

python scripts/manage_kb.py ops harmonization murd_t1_t2

Or read: docs/integration/integration_strategy.md (Harmonization section)


πŸš€ Jan-Feb Action Plan

Meeting Goals: Jan-Feb Wrap-Up

  • Test with 20-participant toy sample
  • Use new NVIDIA Spark GPU (128GB)
  • Offline genetics embeddings (pending)
  • Brain features (fMRI parcellation pending)
  • Complete Stage-1 baselines

Week 1-2: Small Sample Testing (20 participants)

Goal: Test pipeline on toy sample using new NVIDIA Spark GPU (128GB)

# 1. Download 20-participant sample
# - Brain features (fMRI parcellation)
# - Genomics embeddings (offline, pre-trained)

# 2. Test embedding extraction
python scripts/manage_kb.py ops strategy genetics_joo_mdd_cog_v1

# 3. Run on NVIDIA Spark GPU
# 4. Verify pipelines work end-to-end

What to test: - Brain feature download works - Genomics embeddings load correctly - CCA runs without errors - Prediction baselines produce AUROCs

Week 3-4: Run Stage-1 Experiments

Goal: Gene ↔ Brain correlation + prediction baselines

Use these configs: 1. configs/experiments/01_cca_gene_smri.yaml - Gene ↔ sMRI CCA + 1,000 permutations - Check if ρ₁–ρ₃ are significant (p < 0.05)

  1. configs/experiments/02_prediction_baselines.yaml
  2. Gene-only β†’ MDD
  3. Brain-only β†’ MDD
  4. Fusion (Gene+Brain) β†’ MDD
  5. DeLong test: Is Fusion > max(Gene, Brain)?

  6. Document results in kb/results/

Week 5-8: Decide on Escalation

Decision criteria:

Result Signal Next Action
Fusion > max(Gene, Brain) p < 0.05 Strong Consider two-tower contrastive
Fusion β‰ˆ best single modality Weak Focus on improving per-modality models
CCA strong (ρ₁ > 0.3, p < 0.001) Strong Supports two-tower alignment
CCA weak (ρ₁ < 0.2 or p > 0.05) None Keep late fusion, check preprocessing

Templates available: - Two-tower patterns: docs/integration/design_patterns.md - MoT/BAGEL patterns: docs/integration/multimodal_architectures.md


What You Can Do Now (Before Data)

βœ… Available Now

  1. Read model code walkthroughs β€” Understand how each FM works
  2. Study embedding recipes β€” Know what preprocessing to apply
  3. Review experiment configs β€” Understand analysis pipeline
  4. Validate YAML cards β€” python scripts/manage_kb.py validate models
  5. Clone external repos β€” Familiarize with FM codebases

🟑 Waiting For

  • UKB data access approval (fMRI/sMRI features)
  • Genetics embeddings (offline pre-trained)
  • Cha Hospital developmental cohort (future)

πŸ“š Onboarding New Team Members

Recommended reading order: 1. This guide (TEAM_GUIDE.md) 2. README.md β€” High-level overview 3. docs/integration/integration_strategy.md β€” THE PLAYBOOK 4. configs/experiments/01_cca_gene_smri.yaml β€” See what we're running 5. Pick one FM code walkthrough to read in detail


πŸ”¬ Stage-1 Experiments

Experiment 1: CCA (Gene ↔ Brain Association)

Config: configs/experiments/01_cca_gene_smri.yaml

What it does:

  • Tests if gene embeddings share structure with brain embeddings
  • 1,000 permutations to assess significance
  • Reports ρ₁–ρ₃ (canonical correlations) with p-values

Success criteria:

  • ρ₁ > 0.2 with p < 0.05 β†’ significant association
  • Gene/ROI loadings interpretable

Experiment 2: Prediction (Gene vs Brain vs Fusion)

Config: configs/experiments/02_prediction_baselines.yaml

What it does:

  • Compares 3 baselines for MDD prediction:
  • Gene-only (512-D)
  • Brain-only (512-D)
  • Fusion (1024-D concatenation)
  • Uses LR + LightGBM + CatBoost
  • DeLong test to compare AUROCs

Success criteria:

  • If Fusion > max(Gene, Brain) p < 0.05 β†’ integration adds value
  • Document which modality is stronger

Experiment 3: LOGO Attribution

Config: configs/experiments/03_logo_gene_attribution.yaml

What it does:

  • Leave-one-gene-out Ξ”AUC
  • Identifies which genes contribute most to prediction
  • Wilcoxon test + FDR correction

Success criteria:

  • Find significant genes (p < 0.05 FDR-corrected)
  • Compare with literature (SOD2, HOXA10, etc.)

Escalation Decision Tree

Start: Run Stage-1 (CCA + Prediction + LOGO)
  β”‚
  β”œβ”€ Fusion > single-modality (p < 0.05)?
  β”‚  β”‚
  β”‚  β”œβ”€ YES β†’ CCA also significant?
  β”‚  β”‚  β”‚
  β”‚  β”‚  β”œβ”€ YES β†’ Consider two-tower contrastive
  β”‚  β”‚  β”‚        (frozen FMs + small projectors)
  β”‚  β”‚  β”‚
  β”‚  β”‚  β””─ NO β†’ Keep late fusion, improve single-modality
  β”‚  β”‚
  β”‚  β””─ NO β†’ Focus on better per-modality embeddings
  β”‚           Try harmonization (ComBat, MURD)

πŸ“Š Data Status

Note: Data Documentation vs Availability

This KB documents how to use data, not when data is ready.
Actual data availability is project-specific and tracked elsewhere.

Dataset Docs Status Access Notes
hg38 reference βœ“ Ready Public Reference genome
Genomic benchmarks βœ“ Ready Public Standard benchmarks
UKB fMRI/sMRI βœ“ Pending Restricted Features can be downloaded
Genetics embeddings βœ“ Pending Internal Offline pre-trained embeddings
Cha Hospital dev βœ“ Future Restricted Developmental research

Utilities

# Validate all YAML cards
python scripts/manage_kb.py validate models
python scripts/manage_kb.py validate datasets

# Query embedding recipe
python scripts/manage_kb.py ops strategy genetics_joo_mdd_cog_v1

# Query harmonization method
python scripts/manage_kb.py ops harmonization combat_smri

# View docs locally
mkdocs serve # Visit http://localhost:8000

# Online docs
https://allison-eunse.github.io/neuro-omics-kb/

❓ FAQ

Which genetics FM should I use?

Answer: Start with the recommended pipeline (genetics_joo_mdd_cog_v1):

  • 38 MDD genes from Yoon et al.
  • RC-averaged embeddings
  • Pre-validated gene set

Then compare with other FMs if needed (Caduceus, DNABERT-2, Evo2)

Should I use sMRI or fMRI features?

Answer: Both are documented:

  • sMRI: FreeSurfer ROIs (~176 features) β†’ Good for structural analysis
  • fMRI: Parcellation data β†’ Follow fMRI-gene analysis recipes in integration docs

Start with whichever is available first.

Do I need to build a new FM?

No! Stage-1 uses:

  • Existing genetics FMs (pre-trained embeddings)
  • Existing brain FMs (SwiFT, BrainLM, etc.)
  • Late fusion = just concatenate embeddings

Only escalate to two-tower/unified FM if Stage-1 shows clear fusion benefit.

What about Cha Hospital / developmental data?

Future work. The KB has:

  • Dataset card template: kb/datasets/cha_dev_longitudinal.yaml
  • Embedding recipes: cha_dev_smri_pca64_v1, cha_dev_eeg_fm_v1, cha_dev_behaviour_latent_v1
  • Experiment templates: configs/experiments/dev_01_brain_only_baseline.yaml

    Focus on UKB first (Jan-Feb wrap-up), then extend to developmental.


Key Principle

This KB answers: - βœ… "How do I extract embeddings?" - βœ… "Which FM should I use?" - βœ… "How do I run CCA?" - βœ… "When should I escalate to two-tower?"


Questions?

  • Model choice: Check docs/models/<category>/index.md
  • Integration strategy: Read docs/integration/integration_strategy.md
  • Embedding recipes: Query python scripts/manage_kb.py ops strategy <id>
  • Everything else: Ask Allison or check online docs

Bottom Line: This repo is your map + spec. Run Stage-1 experiments (CCA + prediction baselines) end-to-end, then decide on escalation based on results.

Jan-Feb Goal: Complete Stage-1 with offline genetics embeddings + brain features β†’ document results β†’ decide next steps.