π§¬π§ Team User Guide¶
For: Lab members working on BrainβGenetics FM integration
Quick Start for New Team Members
- Read this guide (you're here!)
- Browse the Integration Strategy
- Check experiment configs
π― What This Repo Does¶
This is your documentation-first knowledge base β the map and spec for the BrainβGenetics program.
Repository Structure¶
kb/βββ model_cards/ β 20 FM specs (17 FMs + 3 reference)
βββ paper_cards/ β 30 research papers with structured takeaways
βββ datasets/ β 19 dataset schemas (UKB, HCP, Cha, benchmarks)
βββ integration_cards/ β 6 integration recipes
What's Documented¶
| Category | What It Is | Examples |
|---|---|---|
| Genetics FMs | DNA sequence foundation models for gene-level embeddings | Caduceus, DNABERT-2, Evo 2, HyenaDNA, GENERator |
| Brain FMs | Neuroimaging models for fMRI/sMRI subject embeddings | BrainLM, Brain-JEPA, BrainMT, Brain Harmony, SwiFT |
| Multimodal FMs | Clinical & unified multimodal architecture references | BAGEL, MoT, M3FM, Me-LLaMA, TITAN, Flamingo, FMS-Medical |
| Research Papers | Curated paper summaries with implementation notes | RC-equivariance, Ensemble Integration, MURD, Yoon BioKDD'25 |
| Datasets | Data schema specs and preprocessing protocols | UKB (fMRI, sMRI, WES), HCP, Cha developmental, benchmarks |
| Integration & Strategy | Embedding recipes, harmonization, fusion playbooks | genetics_joo_mdd_cog_v1, murd_t1_t2, CCA + permutation |
The Playbook¶
Strategy: Late fusion β Two-tower contrastive β MoT/unified BOM
| Phase | When | What |
|---|---|---|
| Stage 1 | Now | Per-modality FMs + 512-D embeddings + late fusion |
| Stage 2 | If fusion wins | Two-tower contrastive / EI stacking |
| Stage 3 | Long-term | MoT/BAGEL unified architectures |
π Canonical Embedding Recipes¶
All recipes defined in kb/integration_cards/embedding_strategies.yaml
Query any recipe: python scripts/manage_kb.py ops strategy <recipe_id>
| Recipe ID | Type | Output | Pipeline |
|---|---|---|---|
genetics_gene_fm_pca512_v1 |
genetics | 512-D | Caduceus/DNABERT-2/Evo2 + RC-averaging |
genetics_joo_mdd_cog_v1 |
genetics | 512-D | Prof. Joo's 38 MDD genes β |
smri_free_surfer_pca512_v1 |
brain | 512-D | FreeSurfer ROIs β residualize β PCA |
rsfmri_swift_segments_v1 |
brain | 512-D | SwiFT segments β mean pool β PCA |
rsfmri_brainlm_segments_v1 |
brain | 512-D | BrainLM CLS tokens β mean pool |
fusion_concat_gene_brain_1024_v1 |
fusion | 1024-D | Concat(Geneβ ββ + Brainβ ββ) |
β = Recommended starting point
Query a recipe:
πΊοΈ How to Navigate¶
β "I need to understand a specific FM"¶
Example: Understanding Caduceus
- Overview: Caduceus model docs
- Step-by-step: Caduceus Code Walkthrough
- Code:
external_repos/caduceus/ - Metadata:
kb/model_cards/caduceus.yaml
β "I want to run CCA / prediction baselines"¶
- Read the playbook:
docs/integration/integration_strategy.md - Check the recipe:
kb/integration_cards/embedding_strategies.yaml - Use the config:
configs/experiments/01_cca_gene_smri.yamlor02_prediction_baselines.yaml
β "How do I preprocess [modality]?"¶
- Genetics:
docs/integration/modality_features/genomics.md - sMRI:
docs/integration/modality_features/smri.md - fMRI:
docs/integration/modality_features/fmri.md
β "Which harmonization method?"¶
Or read: docs/integration/integration_strategy.md (Harmonization section)
π Jan-Feb Action Plan¶
Meeting Goals: Jan-Feb Wrap-Up
- Test with 20-participant toy sample
- Use new NVIDIA Spark GPU (128GB)
- Offline genetics embeddings (pending)
- Brain features (fMRI parcellation pending)
- Complete Stage-1 baselines
Week 1-2: Small Sample Testing (20 participants)¶
Goal: Test pipeline on toy sample using new NVIDIA Spark GPU (128GB)
# - Brain features (fMRI parcellation)
# - Genomics embeddings (offline, pre-trained)
# 2. Test embedding extraction
python scripts/manage_kb.py ops strategy genetics_joo_mdd_cog_v1
# 3. Run on NVIDIA Spark GPU
# 4. Verify pipelines work end-to-end
What to test: - Brain feature download works - Genomics embeddings load correctly - CCA runs without errors - Prediction baselines produce AUROCs
Week 3-4: Run Stage-1 Experiments¶
Goal: Gene β Brain correlation + prediction baselines
Use these configs:
1. configs/experiments/01_cca_gene_smri.yaml
- Gene β sMRI CCA + 1,000 permutations
- Check if ΟββΟβ are significant (p < 0.05)
configs/experiments/02_prediction_baselines.yaml- Gene-only β MDD
- Brain-only β MDD
- Fusion (Gene+Brain) β MDD
-
DeLong test: Is Fusion > max(Gene, Brain)?
-
Document results in
kb/results/
Week 5-8: Decide on Escalation¶
Decision criteria:
| Result | Signal | Next Action |
|---|---|---|
Fusion > max(Gene, Brain) p < 0.05 |
Strong | Consider two-tower contrastive |
Fusion β best single modality |
Weak | Focus on improving per-modality models |
| CCA strong (Οβ > 0.3, p < 0.001) | Strong | Supports two-tower alignment |
| CCA weak (Οβ < 0.2 or p > 0.05) | None | Keep late fusion, check preprocessing |
Templates available:
- Two-tower patterns: docs/integration/design_patterns.md
- MoT/BAGEL patterns: docs/integration/multimodal_architectures.md
What You Can Do Now (Before Data)¶
β Available Now¶
- Read model code walkthroughs β Understand how each FM works
- Study embedding recipes β Know what preprocessing to apply
- Review experiment configs β Understand analysis pipeline
- Validate YAML cards β
python scripts/manage_kb.py validate models - Clone external repos β Familiarize with FM codebases
π‘ Waiting For¶
- UKB data access approval (fMRI/sMRI features)
- Genetics embeddings (offline pre-trained)
- Cha Hospital developmental cohort (future)
π Onboarding New Team Members¶
Recommended reading order:
1. This guide (TEAM_GUIDE.md)
2. README.md β High-level overview
3. docs/integration/integration_strategy.md β THE PLAYBOOK
4. configs/experiments/01_cca_gene_smri.yaml β See what we're running
5. Pick one FM code walkthrough to read in detail
π¬ Stage-1 Experiments¶
Experiment 1: CCA (Gene β Brain Association)
Config: configs/experiments/01_cca_gene_smri.yaml
What it does:
- Tests if gene embeddings share structure with brain embeddings
- 1,000 permutations to assess significance
- Reports ΟββΟβ (canonical correlations) with p-values
Success criteria:
- Οβ > 0.2 with p < 0.05 β significant association
- Gene/ROI loadings interpretable
Experiment 2: Prediction (Gene vs Brain vs Fusion)
Config: configs/experiments/02_prediction_baselines.yaml
What it does:
- Compares 3 baselines for MDD prediction:
- Gene-only (512-D)
- Brain-only (512-D)
- Fusion (1024-D concatenation)
- Uses LR + LightGBM + CatBoost
- DeLong test to compare AUROCs
Success criteria:
- If Fusion > max(Gene, Brain) p < 0.05 β integration adds value
- Document which modality is stronger
Experiment 3: LOGO Attribution
Config: configs/experiments/03_logo_gene_attribution.yaml
What it does:
- Leave-one-gene-out ΞAUC
- Identifies which genes contribute most to prediction
- Wilcoxon test + FDR correction
Success criteria:
- Find significant genes (p < 0.05 FDR-corrected)
- Compare with literature (SOD2, HOXA10, etc.)
Escalation Decision Tree¶
β
ββ Fusion > single-modality (p < 0.05)?
β β
β ββ YES β CCA also significant?
β β β
β β ββ YES β Consider two-tower contrastive
β β β (frozen FMs + small projectors)
β β β
β β ββ NO β Keep late fusion, improve single-modality
β β
β ββ NO β Focus on better per-modality embeddings
β Try harmonization (ComBat, MURD)
π Data Status¶
Note: Data Documentation vs Availability
This KB documents how to use data, not when data is ready.
Actual data availability is project-specific and tracked elsewhere.
| Dataset | Docs | Status | Access | Notes |
|---|---|---|---|---|
| hg38 reference | β | Ready | Public | Reference genome |
| Genomic benchmarks | β | Ready | Public | Standard benchmarks |
| UKB fMRI/sMRI | β | Pending | Restricted | Features can be downloaded |
| Genetics embeddings | β | Pending | Internal | Offline pre-trained embeddings |
| Cha Hospital dev | β | Future | Restricted | Developmental research |
Utilities¶
python scripts/manage_kb.py validate models
python scripts/manage_kb.py validate datasets
# Query embedding recipe
python scripts/manage_kb.py ops strategy genetics_joo_mdd_cog_v1
# Query harmonization method
python scripts/manage_kb.py ops harmonization combat_smri
# View docs locally
mkdocs serve # Visit http://localhost:8000
# Online docs
https://allison-eunse.github.io/neuro-omics-kb/
β FAQ¶
Which genetics FM should I use?
Answer: Start with the recommended pipeline (genetics_joo_mdd_cog_v1):
- 38 MDD genes from Yoon et al.
- RC-averaged embeddings
- Pre-validated gene set
Then compare with other FMs if needed (Caduceus, DNABERT-2, Evo2)
Should I use sMRI or fMRI features?
Answer: Both are documented:
- sMRI: FreeSurfer ROIs (~176 features) β Good for structural analysis
- fMRI: Parcellation data β Follow fMRI-gene analysis recipes in integration docs
Start with whichever is available first.
Do I need to build a new FM?
No! Stage-1 uses:
- Existing genetics FMs (pre-trained embeddings)
- Existing brain FMs (SwiFT, BrainLM, etc.)
- Late fusion = just concatenate embeddings
Only escalate to two-tower/unified FM if Stage-1 shows clear fusion benefit.
What about Cha Hospital / developmental data?
Future work. The KB has:
- Dataset card template:
kb/datasets/cha_dev_longitudinal.yaml - Embedding recipes:
cha_dev_smri_pca64_v1,cha_dev_eeg_fm_v1,cha_dev_behaviour_latent_v1 -
Experiment templates:
configs/experiments/dev_01_brain_only_baseline.yamlFocus on UKB first (Jan-Feb wrap-up), then extend to developmental.
Key Principle¶
This KB answers: - β "How do I extract embeddings?" - β "Which FM should I use?" - β "How do I run CCA?" - β "When should I escalate to two-tower?"
Questions?¶
- Model choice: Check
docs/models/<category>/index.md - Integration strategy: Read
docs/integration/integration_strategy.md - Embedding recipes: Query
python scripts/manage_kb.py ops strategy <id> - Everything else: Ask Allison or check online docs
Bottom Line: This repo is your map + spec. Run Stage-1 experiments (CCA + prediction baselines) end-to-end, then decide on escalation based on results.
Jan-Feb Goal: Complete Stage-1 with offline genetics embeddings + brain features β document results β decide next steps.