Skip to content

Analysis Recipe: CCA + Permutation

Goal

Quantify cross-modal associations (e.g., gene embeddings vs sMRI features) while controlling confounds and validating significance via permutation testing.

Reference implementations (your repos)

  • gene-brain-CCA/scripts/run_cca.py: CCA/SCCA + permutation testing (saves U/V/W + perm_rs.npy + results.json).
  • nesap-genomics/embedding/README.md: standardized embedding export formats + flags.
  • scripts/check_embedding_alignment.py (this repo): validate Yoon/GENNIElab DNABERT‑2 export roots (iids/labels/covariates aligned; chunk shapes consistent).

Inputs

  • Residualized & z-scored feature matrices X (modalities A) and Y (modalities B) per fold.
  • Covariate design matrices (age, sex, site/scanner, motion FD, SES, genetic PCs).
  • Train/test splits (identical across modalities).

Preprocessing Checklist

  1. Fit scaler + residualization models on train split only; apply to both train/test within the fold.
  2. Optional PCA/MLP projector (often to ≤512-D) per modality; store fit parameters.
  3. Log confound regression coefficients for reproducibility.

Procedure

  1. Fit CCA on train data: cca = CCA(n_components=k) with shrinkage/regularization if needed.
  2. Use cca.fit_transform(X_train, Y_train) to obtain both canonical variates (U,V). Don’t use cca.transform(Y) expecting Y-side scores.
  3. Transform: Obtain canonical variates for both train and test sets (same projection space per fold).
  4. Record metrics: Canonical correlations (ρ₁…ρ_k), variance explained, loadings.
  5. Permutation test: Shuffle subject order in modality B (within the train split), refit CCA B times (≥1,000); store perm_rs[b, i].
  6. p-values: p = (count(ρ_null ≥ ρ_obs) + 1) / (B + 1).
  7. Confidence intervals (optional): Bootstrap subjects within folds.
  8. Partial correlations to outcomes: Regress canonical scores and clinical targets on covariates; correlate residuals or use covariate-adjusted regression.

Logging

  • Save U, V, W_x, W_y, canonical correlations, perm_rs, and p-values to artifacts/generated/cca/<experiment_id>/.
  • Store config (modalities, projectors, covariates, seeds) alongside results.

Reporting Template

  • Table of top 3 ρ with permutation p-values and 95% CI.
  • Heatmap or bar chart of feature loadings (with sparse thresholding if needed).
  • Partial correlation table linking canonical scores to clinical outcomes (effect size, p, FDR q).

References

  • EI & oncology multimodal review for integration motivation.
  • Hotelling (1936) CCA; Witten et al. (2009) sparse CCA; permutation-testing guidelines.