Analysis Recipe: CCA + Permutation¶

Goal¶

Quantify cross-modal associations (e.g., gene embeddings vs sMRI features) while controlling confounds and validating significance via permutation testing.

Reference implementations (your repos)¶

gene-brain-CCA/scripts/run_cca.py: CCA/SCCA + permutation testing (saves U/V/W + perm_rs.npy + results.json).
nesap-genomics/embedding/README.md: standardized embedding export formats + flags.
scripts/check_embedding_alignment.py (this repo): validate Yoon/GENNIElab DNABERT‑2 export roots (iids/labels/covariates aligned; chunk shapes consistent).

Inputs¶

Residualized & z-scored feature matrices X (modalities A) and Y (modalities B) per fold.
Covariate design matrices (age, sex, site/scanner, motion FD, SES, genetic PCs).
Train/test splits (identical across modalities).

Preprocessing Checklist¶

Fit scaler + residualization models on train split only; apply to both train/test within the fold.
Optional PCA/MLP projector (often to ≤512-D) per modality; store fit parameters.
Log confound regression coefficients for reproducibility.

Procedure¶

Fit CCA on train data: cca = CCA(n_components=k) with shrinkage/regularization if needed.
Use cca.fit_transform(X_train, Y_train) to obtain both canonical variates (U,V). Don’t use cca.transform(Y) expecting Y-side scores.
Transform: Obtain canonical variates for both train and test sets (same projection space per fold).
Record metrics: Canonical correlations (ρ₁…ρ_k), variance explained, loadings.
Permutation test: Shuffle subject order in modality B (within the train split), refit CCA B times (≥1,000); store perm_rs[b, i].
p-values: p = (count(ρ_null ≥ ρ_obs) + 1) / (B + 1).
Confidence intervals (optional): Bootstrap subjects within folds.
Partial correlations to outcomes: Regress canonical scores and clinical targets on covariates; correlate residuals or use covariate-adjusted regression.

Logging¶

Save U, V, W_x, W_y, canonical correlations, perm_rs, and p-values to artifacts/generated/cca/<experiment_id>/.
Store config (modalities, projectors, covariates, seeds) alongside results.

Reporting Template¶

Table of top 3 ρ with permutation p-values and 95% CI.
Heatmap or bar chart of feature loadings (with sparse thresholding if needed).
Partial correlation table linking canonical scores to clinical outcomes (effect size, p, FDR q).

References¶

EI & oncology multimodal review for integration motivation.
Hotelling (1936) CCA; Witten et al. (2009) sparse CCA; permutation-testing guidelines.