Analysis Recipe: CCA + Permutation¶
Goal¶
Quantify cross-modal associations (e.g., gene embeddings vs sMRI features) while controlling confounds and validating significance via permutation testing.
Reference implementations (your repos)¶
gene-brain-CCA/scripts/run_cca.py: CCA/SCCA + permutation testing (saves U/V/W +perm_rs.npy+results.json).nesap-genomics/embedding/README.md: standardized embedding export formats + flags.scripts/check_embedding_alignment.py(this repo): validate Yoon/GENNIElab DNABERT‑2 export roots (iids/labels/covariates aligned; chunk shapes consistent).
Inputs¶
- Residualized & z-scored feature matrices
X(modalities A) andY(modalities B) per fold. - Covariate design matrices (age, sex, site/scanner, motion FD, SES, genetic PCs).
- Train/test splits (identical across modalities).
Preprocessing Checklist¶
- Fit scaler + residualization models on train split only; apply to both train/test within the fold.
- Optional PCA/MLP projector (often to ≤512-D) per modality; store fit parameters.
- Log confound regression coefficients for reproducibility.
Procedure¶
- Fit CCA on train data:
cca = CCA(n_components=k)with shrinkage/regularization if needed. - Use
cca.fit_transform(X_train, Y_train)to obtain both canonical variates (U,V). Don’t usecca.transform(Y)expecting Y-side scores. - Transform: Obtain canonical variates for both train and test sets (same projection space per fold).
- Record metrics: Canonical correlations (ρ₁…ρ_k), variance explained, loadings.
- Permutation test: Shuffle subject order in modality B (within the train split), refit CCA
Btimes (≥1,000); storeperm_rs[b, i]. - p-values:
p = (count(ρ_null ≥ ρ_obs) + 1) / (B + 1). - Confidence intervals (optional): Bootstrap subjects within folds.
- Partial correlations to outcomes: Regress canonical scores and clinical targets on covariates; correlate residuals or use covariate-adjusted regression.
Logging¶
- Save
U,V,W_x,W_y, canonical correlations,perm_rs, and p-values toartifacts/generated/cca/<experiment_id>/. - Store config (modalities, projectors, covariates, seeds) alongside results.
Reporting Template¶
- Table of top 3 ρ with permutation p-values and 95% CI.
- Heatmap or bar chart of feature loadings (with sparse thresholding if needed).
- Partial correlation table linking canonical scores to clinical outcomes (effect size, p, FDR q).
References¶
- EI & oncology multimodal review for integration motivation.
- Hotelling (1936) CCA; Witten et al. (2009) sparse CCA; permutation-testing guidelines.