π Integration Strategy¶
Late fusion first, then escalate to contrastive and unified architectures
π― Overall Philosophy¶
Late fusion / integration first, then scale if we see gains.
π§¬π§ Why This Applies to Genes Γ Brain¶
- Heterogeneous semantics: Nucleic-acid sequence vs morphology/dynamics β maximize modality specificity before fusion
- Different confounds: Ancestry/batch vs site/motion/TR β deconfound independently
π Baselines¶
- Preprocess per modality
- Z-score features.
- Residualize against: age, sex, site/scanner, motion (FD), SES (if available), genetic PCs (PC1βPC10).
- Dimensionality
- Project to 512 dims per modality (PCA or tiny MLP).
- CCA + permutation
- CCA on train folds; 1,000 shuffles; report Ο1βΟ3 with p-values.
- Prediction
- LR (balanced) and LightGBM/CatBoost on Gene, Brain, Fusion; same CV folds; AUROC/AUPRC; DeLong/bootstrap for Fusion vs single-modality.
π Embedding Strategy Registry¶
Recipes live under kb/integration_cards/embedding_strategies.yaml
Print them via:
Available strategies:
- sMRI (smri_free_surfer_pca512_v1). FreeSurfer ROI table (~176 features) β fold-wise z-score β residualize age/sex/site/ICV β PCAβ512. Future variants: FM encoders, diffusion MRI. Sources: docs/integration/modality_features/smri.md, FreeSurfer refs.
- rs-fMRI baseline (rsfmri_swift_segments_v1). SwiFT exports per 20-frame segment β mean pool tokens β run mean β subject mean β residualize age/sex/site/motion β PCAβ512. Variants exist for BrainLM (rsfmri_brainlm_segments_v1), Brain-JEPA (rsfmri_brainjepa_roi_v1), and BrainMT (rsfmri_brainmt_segments_v1); each references the corresponding walkthrough/code.
- Genetics (genetics_gene_fm_pca512_v1). RC-averaged gene FMs (Caduceus/DNABERT-2/Evo2/HyenaDNA/GENERaTOR) β exon β gene pooling β concatenate curated gene set β residualize age/sex/ancestry PCs/batch β PCAβ512. For nonβRC-equivariant encoders, follow RCCR/RC-averaging guidance in the genomics modality spec.
- Fusion (fusion_concat_gene_brain_1024_v1). Concatenate the 512-D genetics vector with the chosen 512-D brain vector; z-score each block independently before concatenation.
Experiments now reference these IDs (see configs/experiments/*.yaml) to keep per-subject embeddings traceable.
π§ Harmonization & Site Effects¶
Cataloged in kb/integration_cards/harmonization_methods.yaml
Query via:
Available harmonization methods:
- Default (none_baseline). Feature-level z-score + covariate residualization; always record site/motion covariates.
- Statistical (combat_smri). ROI-wise ComBat before PCA for sMRI (Fortin et al., 2018). Run the 02_harmonization_ablation_smri config to benchmark vs. the baseline.
- Deep (murd_t1_t2). Apply MURD (Liu & Yap 2024) to T1/T2 volumes before FreeSurfer extraction; compare vs. ComBat and baseline to judge if image-space harmonization improves CCA/prediction.^[See MURD paper summary: docs/generated/kb_curated/papers-md/murd_2024.md and card kb/paper_cards/murd_2024.yaml.]
- Representation unlearning (site_unlearning_module). Optional adversarial head that removes site labels from embedding space (Dinsdale et al., 2021); treat as experimental until harmonization ablations justify it.^[See site-unlearning paper summary: docs/generated/kb_curated/papers-md/dinsdale_site_unlearning_2021.md and card kb/paper_cards/dinsdale_site_unlearning_2021.yaml.]
Record harmonization IDs (and preprocessing pipeline IDs such as rsfmri_preprocessing_pipelines.hcp_like_minimal) alongside embedding strategy IDs in every run.
π Escalation Criteria¶
If Fusion > max(Gene, Brain) with p < 0.05 (DeLong/bootstrap), consider: - Two-tower contrastive alignment (frozen encoders; small projectors). - EI stacking over per-modality models. - Harmony-style hub tokens/TAPE if TR/site heterogeneity limits fMRI.
π Interpretability¶
- LOGO ΞAUC with Wilcoxon + FDR for gene attribution
- CCA loadings: Partial correlations of axes with outcomes (covariate-adjusted)
Tabular FM (TabPFN) baseline¶
- TabPFN (Nature 2024) is tracked under
kb/model_cards/tabpfn.yaml. It is a predictor, not an fMRI encoder. - Use TabPFN as a strong small-N tabular baseline on:
- Raw sMRI ROI tables (
smri_free_surfer_raw_176). - Genetics summary tables (
genetics_pgs_20traits). - Early-fusion tabular features (ROI + PGS).
- Compare TabPFN vs. LR/LightGBM in
configs/experiments/03_prediction_baselines_tabular.yamlto quantify how much structured embeddings help beyond a tabular FM.
Risks and mitigations
- Leakage: do scaling/residualization within train folds; apply transforms to test.
- Site imbalance: use group/site-aware CV when feasible.
- Overfitting at high dims: prefer 256β512; regularize LR; early stopping for GBDT.