CCA + Permutation¶

This is the Stage‑1 diagnostic for cross‑modal structure (genes ↔ brain) before you invest in heavier fusion.

Reference implementations (your sibling repos)

gene-brain-CCA/scripts/run_cca.py: CCA/SCCA + permutation p‑values, saves canonical variates + weights + perm_rs.npy.
gene-brain-CCA/scripts/align_resid_pca.py: subject alignment + residualize(age, sex) + z‑score + PCA (portable baseline).
gene-brain-CCA/scripts/build_x_fmri_fc.py: ROI time series → FC upper‑triangle features (Fisher‑z) (optional classical baseline).
nesap-genomics/embedding/README.md: standardized embedding export formats (embeddings_{k}_layer_last.npy, pooling flags, etc.).
scripts/check_embedding_alignment.py (this repo): reusable validator for Yoon/GENNIElab-style DNABERT‑2 embedding roots (111 genes × 49 chunks, aligned to iids.npy).

Inputs

X_gene, X_brain: residualized and standardized matrices (N × d_gene, N × d_brain).
In practice you often project to ≤512 dims per modality (PCA/MLP) for stability, but it’s not required if the raw feature dims are already moderate (e.g., fMRI ROI mean 180).
Covariates: used upstream during residualization.
Metadata: record embedding_strategies.<id>, harmonization_methods.<id>, and (for fMRI) rsfmri_preprocessing_pipelines.<id> to ensure results are traceable.

Context in integration plan

This recipe is part of the diagnostic / exploration layer of the integration stack.
Run it after per-modality sanity checks but before heavier fusion models; it tells you whether there is cross-modal structure worth chasing.
Treat it as a companion to the late-fusion-first baselines rather than a replacement for prediction experiments.

Protocol

0. Build per-subject feature matrices (X_gene, X_brain)¶

Genetics (UKB; Yoon/GENNIElab DNABERT‑2 exports):
Asset you have now:
- iids.npy, labels.npy, covariates_age.npy, covariates_sex.npy are all aligned and have N = 28,932 rows.
- Embeddings are 111 genes × 768‑d per gene, sharded into 49 chunks per gene as <EMBED_ROOT>/<GENE>/embeddings_{k}_layer_last.npy.
- Practical mapping rule: row r ↔ iids[r], and the same row aligns with labels/age/sex.
Subject‑level representations (all derived from X_3d ∈ R^{N×G×F}):
- Per-gene scalar reduction (current Stage‑1 baseline in gene-brain-CCA):
- reduce each gene’s 768‑d embedding to 1 scalar (mean/max/median over F) → X_ng ∈ R^{N×G} (here G=111)
- if you run PCA with target ≥G, it caps at 111 components: (k=\min(512, G)=111) (so “gene PCA = 111”)
- Wide concat (expressive, higher compute): X_wide = X_3d.reshape(N, G·F) (then usually PCA → 256–512).
- Gene-mean pool (global baseline): X_mean[n,f] = (1/G) Σ_g X_3d[n,g,f] → R^{N×F}.
- Gene-max / top‑k mean (localized baseline): X_max[n,f] = max_g X_3d[n,g,f] (or top‑k mean).
- Tip: standardize within each gene block before pooling across genes, otherwise high-variance genes dominate max.
Brain (sMRI): FreeSurfer ROI tables (~176 features) → residualize/z‑score → PCA.
Brain (fMRI; ROI mean baseline you have now):
You built subject-aligned matrices from UKB ROI time series (HCP MMP1 parcellation) and saved:
- fmri_eids_180.npy: (40,792,)
- fmri_X_180.npy: (40,792, 180) (time‑mean ROI features)
180 vs 360 inconsistency handling:
- If only 360 exists: reshape (360,) → (2,180) and average halves → 180 (fallback to avoid dropping subjects).
- Keep a separate native 360‑d variant for analyses that need it (smaller N).
Operational tip (server): keep stable canonical symlinks (e.g., fmri_X_180.npy) pointing at the latest build output so downstream code never changes.
Brain (optional fMRI classical FC baseline): ROI time series → FC matrix → Fisher‑z → vectorize upper triangle (length (R(R-1)/2)) → residualize/z‑score → PCA (see gene-brain-CCA/scripts/build_x_fmri_fc.py).

0.5 Build the subject intersection (required before CCA/SCCA)¶

Before any CCA/SCCA is meaningful, build the intersection between: - genetics iids (N=28,932) and - fMRI fmri_eids_180 (N=40,792), then reorder both modalities into the same EID order: - X_gene: (N_overlap × p_gene) - Y_fmri: (N_overlap × 180)

This is the last plumbing step that makes “row i” mean the same subject in both modalities.

1. Alignment + preprocessing (do this before CCA)¶

Align rows by subject ID (intersection; handle duplicates deterministically).
Residualize both modalities on the same covariates (typical: age, sex, site/scanner; add ICV for sMRI; add motion for fMRI; add ancestry PCs/batch for genetics).
Z‑score features and project (PCA/MLP) fit on train split only if you will evaluate anything out-of-sample.

2. Fold discipline (leakage control)¶

Use K stratified folds (make them site-/scanner-aware when possible).
For each fold:
Fit preprocessing (residualization, scaler, PCA/projector) on the train split and apply to train+test.
Fit CCA on (X_gene_train, X_brain_train).
Transform train and held-out fold to canonical scores so downstream metrics share the same projection space.

If you are only producing descriptive “joint embeddings” with no downstream evaluation, you can fit on the full aligned dataset—but do not report generalization claims from that.

3. Fit CCA (implementation gotcha)¶

In sklearn, use cca.fit_transform(X, Y) to obtain both canonical variates (U) and (V).
cca.transform(Y) is not the Y-side transform; it treats Y as X and returns X-side scores (see note in gene-brain-CCA/scripts/run_cca.py).
Compute canonical correlations as ( \rho_i = \mathrm{corr}(U_i, V_i) ) for (i=1..k).
Save weights/loadings for interpretation (genes/ROIs that drive each component).

4. Permutation test¶

Set B = 1,000 (or at least 500 for quick scans).
For each b ∈ {1 … B} inside the training split:
Permute subject order in one modality (typically shuffle rows of modality B) after preprocessing so covariates are fixed.
Refit CCA on the permuted pair (the null must reflect “best possible correlation under broken correspondence”).
Store canonical correlations ( \rho_i^{(b)} ) (at least for (i=1..3)).
Compute one-sided empirical p-values (per component):
p_i = ( # {ρ_i^(b) ≥ ρ_i_obs} + 1 ) / (B + 1 ).

Optional (recommended if you compare multiple components): - Max-statistic null (FWER across components): store max_i ρ_i^(b) per permutation and compute p-values against that null. - Or apply BH-FDR to the per-component permutation p-values.

5. Reporting & interpretation¶

Report ρ1–ρ3 with permutation p-values (per fold and averaged).
Store the full null distributions (perm_rs[b, i]) so you can re-plot/re-test later.
Optionally bootstrap the canonical correlations for 95 % CIs.
Surface top loadings / feature contributions for both modalities to explain shared signal.

Recommended ablation grid (ties to your pooling + CCA/SCCA story)¶

Run with identical folds and identical covariate handling: - Wide + CCA - Wide + SCCA - Gene-mean pooled + CCA - Gene-max (or top‑k mean) pooled + CCA

Optional (if you have time): run pooled + SCCA too. This cleanly separates “localization via pooling” from “localization via sparse coupling”.

Why pair CCA with permutations?

CCA will always produce non-zero canonical correlations—even when there is no shared structure—because it can overfit high-dimensional spaces.
The permutation loop builds a modality-shuffled null distribution so we can report p-values (or FDR-adjusted thresholds) and avoid over-interpreting noise.
This statistical check is lightweight enough for “quick tests” while still respecting site/ancestry confounds.

Pitfalls - Never fit CCA on all data. - Keep the same permutation protocol across folds for comparability. - Keep n_components ≤ min(d_gene, d_brain) and prefer n_components ≪ N for stability (or use SCCA/regularized CCA when (N\ll p)).

Optional next step: Stage‑2 prediction from CCA/SCCA embeddings

Use the canonical variates (U) and/or (V) (or their concatenation) as features for downstream prediction (e.g., MDD).
Reuse the exact same folds as Stage‑1 so results are comparable to your gene-only / brain-only baselines.
Reference implementation: gene-brain-CCA/scripts/stage2_predict.py (LogReg/MLP on gene-only vs brain-only vs joint variates).