Partial Correlations
Overview
Partial correlation analysis controls for confounding variables when evaluating relationships between model representations and outcomes. This is essential for neurogenomic studies where age, sex, site, and other covariates can introduce spurious correlations.
Background
Partial Correlation measures the relationship between two variables while controlling for one or more confounding variables.
Given: - X: Feature of interest (e.g., brain embedding dimension) - Y: Outcome (e.g., cognitive score) - Z: Confounders (e.g., age, sex, scanner site)
The partial correlation between X and Y controlling for Z is the correlation between the residuals of: 1. X regressed on Z 2. Y regressed on Z
Why Partial Correlations Matter
Common Confounders in Neurogenomics
Neuroimaging
- Age: Affects brain structure and function
- Sex: Systematic differences in brain anatomy
- Scanner site: Multi-site studies have acquisition differences
- Head motion: Correlates with many clinical variables
- Total intracranial volume (TIV): Affects structural measurements
Genomics
- Batch effects: Technical variation between sequencing runs
- Cell composition: Proportion of cell types in bulk data
- Sequencing depth: Coverage differences across samples
- Population stratification: Ancestry-related genetic variation
Protocol
1. Basic Partial Correlation
import numpy as np
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
def partial_correlation(X, Y, Z):
"""
Compute partial correlation between X and Y controlling for Z.
Args:
X: (n_samples,) or (n_samples, 1)
Y: (n_samples,) or (n_samples, 1)
Z: (n_samples, n_confounds)
Returns:
r: Partial correlation coefficient
p: P-value
"""
# Reshape if needed
X = np.asarray(X).reshape(-1, 1) if X.ndim == 1 else X
Y = np.asarray(Y).reshape(-1, 1) if Y.ndim == 1 else Y
Z = np.asarray(Z).reshape(-1, 1) if Z.ndim == 1 else Z
# Regress out confounds from X
lr_x = LinearRegression()
lr_x.fit(Z, X)
X_resid = X - lr_x.predict(Z)
# Regress out confounds from Y
lr_y = LinearRegression()
lr_y.fit(Z, Y)
Y_resid = Y - lr_y.predict(Z)
# Correlation of residuals
r, p = pearsonr(X_resid.ravel(), Y_resid.ravel())
return r, p
# Example usage
from sklearn.datasets import make_regression
# Simulated data
n_samples = 200
X, Y = make_regression(n_samples=n_samples, n_features=1, noise=10, random_state=42)
X = X.ravel()
# Confounders (age, sex)
age = np.random.randn(n_samples)
sex = np.random.randint(0, 2, n_samples)
Z = np.column_stack([age, sex])
# Compute partial correlation
r_partial, p_partial = partial_correlation(X, Y, Z)
print(f"Partial correlation: r={r_partial:.3f}, p={p_partial:.4f}")
# Compare to raw correlation (without controlling)
r_raw, p_raw = pearsonr(X, Y)
print(f"Raw correlation: r={r_raw:.3f}, p={p_raw:.4f}")
2. Multiple Partial Correlations
When testing many features (e.g., all embedding dimensions):
from statsmodels.stats.multitest import multipletests
def partial_correlation_matrix(X, Y, Z):
"""
Compute partial correlations for multiple features.
Args:
X: (n_samples, n_features) - features to test
Y: (n_samples,) - outcome
Z: (n_samples, n_confounds) - confounders
Returns:
correlations: (n_features,) - partial correlation coefficients
p_values: (n_features,) - uncorrected p-values
p_corrected: (n_features,) - FDR-corrected p-values
"""
n_features = X.shape[1]
correlations = np.zeros(n_features)
p_values = np.zeros(n_features)
for i in range(n_features):
r, p = partial_correlation(X[:, i], Y, Z)
correlations[i] = r
p_values[i] = p
# FDR correction
_, p_corrected, _, _ = multipletests(p_values, method='fdr_bh')
return correlations, p_values, p_corrected
# Example: Test all embedding dimensions
embeddings = model.encode(brain_data) # Shape: (n_samples, embedding_dim)
cognitive_score = load_cognitive_scores() # Shape: (n_samples,)
confounds = np.column_stack([age, sex, site_indicator])
corrs, p_raw, p_fdr = partial_correlation_matrix(
embeddings,
cognitive_score,
confounds
)
# Find significant dimensions (FDR < 0.05)
sig_dims = np.where(p_fdr < 0.05)[0]
print(f"Significant dimensions: {sig_dims}")
print(f"Correlations: {corrs[sig_dims]}")
3. Partial Correlation with Standardization
For better interpretability, standardize all variables:
from sklearn.preprocessing import StandardScaler
def partial_correlation_standardized(X, Y, Z):
"""Partial correlation with standardized variables."""
scaler = StandardScaler()
X_std = scaler.fit_transform(X.reshape(-1, 1)).ravel()
Y_std = scaler.fit_transform(Y.reshape(-1, 1)).ravel()
Z_std = scaler.fit_transform(Z)
return partial_correlation(X_std, Y_std, Z_std)
4. Using pingouin Library
For more advanced partial correlation analysis:
import pingouin as pg
# Create DataFrame
import pandas as pd
df = pd.DataFrame({
'brain_feature': X,
'cognitive_score': Y,
'age': age,
'sex': sex
})
# Partial correlation
result = pg.partial_corr(
data=df,
x='brain_feature',
y='cognitive_score',
covar=['age', 'sex'],
method='pearson'
)
print(result)
# Output: n, r, CI95%, p-val
Domain-Specific Applications
Neuroimaging Example: fMRI Connectivity and Behavior
# Load data
connectivity = load_fmri_connectivity() # (n_subjects, n_connections)
behavior = load_behavior_score() # (n_subjects,)
age = load_age() # (n_subjects,)
sex = load_sex() # (n_subjects,)
site = load_site() # (n_subjects,)
motion = load_head_motion() # (n_subjects,)
# Confound matrix
Z = np.column_stack([age, sex, site, motion])
# Test each connection
n_connections = connectivity.shape[1]
results = []
for i in range(n_connections):
r, p = partial_correlation(connectivity[:, i], behavior, Z)
results.append({'connection_id': i, 'r': r, 'p': p})
results_df = pd.DataFrame(results)
# FDR correction
_, results_df['p_fdr'], _, _ = multipletests(
results_df['p'],
method='fdr_bh'
)
# Significant connections
sig_connections = results_df[results_df['p_fdr'] < 0.05]
print(f"Found {len(sig_connections)} significant connections")
Genomics Example: Gene Expression and Disease
# Load single-cell data
gene_expression = load_gene_expression() # (n_cells, n_genes)
disease_score = load_disease_phenotype() # (n_cells,)
# Confounders
batch = load_batch_id() # (n_cells,)
sequencing_depth = load_total_counts() # (n_cells,)
cell_cycle = load_cell_cycle_score() # (n_cells,)
Z_genomics = np.column_stack([batch, sequencing_depth, cell_cycle])
# Find disease-associated genes (controlling for technical factors)
corrs, p_raw, p_fdr = partial_correlation_matrix(
gene_expression,
disease_score,
Z_genomics
)
# Top disease-associated genes
top_genes_idx = np.argsort(np.abs(corrs))[-20:]
print(f"Top disease-associated genes: {gene_names[top_genes_idx]}")
Visualization
1. Comparison Plot: Raw vs Partial Correlations
import matplotlib.pyplot as plt
# Compute both raw and partial correlations
raw_corrs = []
partial_corrs = []
for i in range(X.shape[1]):
r_raw, _ = pearsonr(X[:, i], Y)
r_partial, _ = partial_correlation(X[:, i], Y, Z)
raw_corrs.append(r_raw)
partial_corrs.append(r_partial)
raw_corrs = np.array(raw_corrs)
partial_corrs = np.array(partial_corrs)
# Scatter plot
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(raw_corrs, partial_corrs, alpha=0.6)
ax.plot([-1, 1], [-1, 1], 'k--', label='Identity')
ax.set_xlabel('Raw Correlation')
ax.set_ylabel('Partial Correlation\n(controlling for confounds)')
ax.set_title('Effect of Confound Correction')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('raw_vs_partial_correlation.png', dpi=150)
2. Manhattan Plot for Genome-Wide Analysis
def manhattan_plot(p_values, chromosome_positions=None):
"""Create Manhattan plot for partial correlation p-values."""
fig, ax = plt.subplots(figsize=(14, 4))
# -log10(p-value) for y-axis
neg_log_p = -np.log10(p_values)
# Color by chromosome (if provided)
if chromosome_positions is not None:
colors = ['blue', 'orange']
for chrom in np.unique(chromosome_positions):
mask = chromosome_positions == chrom
color = colors[chrom % 2]
ax.scatter(np.where(mask)[0], neg_log_p[mask],
c=color, s=5, alpha=0.7)
else:
ax.scatter(range(len(p_values)), neg_log_p, s=5, alpha=0.7)
# Significance threshold line
ax.axhline(-np.log10(0.05), color='red', linestyle='--',
label='p=0.05')
ax.axhline(-np.log10(0.05 / len(p_values)), color='green',
linestyle='--', label='Bonferroni')
ax.set_xlabel('Feature Index')
ax.set_ylabel('-log₁₀(p-value)')
ax.set_title('Partial Correlation Significance')
ax.legend()
plt.tight_layout()
return fig
Interpretation Guidelines
When to Use Partial Correlations
✅ Use when: - Known confounders exist (age, sex, site) - Multi-site studies - Evaluating specific hypotheses while controlling for nuisance variables
❌ Don't use when: - No clear confounders - Confounders are part of the scientific question - Confounders are colliders (can introduce bias)
Effect Size Interpretation
| |r| | Interpretation | |------|----------------| | < 0.1 | Negligible | | 0.1 - 0.3 | Small | | 0.3 - 0.5 | Moderate | | > 0.5 | Large |
Statistical Power
Required sample size for 80% power:
| Effect Size (r) | n (α=0.05) |
|---|---|
| 0.1 | ~780 |
| 0.2 | ~195 |
| 0.3 | ~85 |
| 0.5 | ~30 |
ITU AI4H Alignment
This protocol aligns with:
- DEL3 Section 5.3: Confound control in validation
- DEL10.8 Section 4.2: Covariate adjustment in neurology benchmarks
- DEL0.1: Statistical terminology standards
Best Practices
- Pre-register confounders: Decide which confounders to control before analysis
- Report both raw and partial: Show effect of confound correction
- Visualize confound effects: Plot raw vs. partial correlations
- Use appropriate corrections: FDR or Bonferroni for multiple comparisons
- Check assumptions: Linearity, homoscedasticity, normality
References
- Fisher, R. A. (1924). The distribution of the partial correlation coefficient. Metron, 3, 329-332.
- Baba, K., et al. (2004). Partial correlation and conditional correlation as measures of conditional independence. Australian & New Zealand Journal of Statistics, 46(4), 657-664.
- Vallat, R. (2018). Pingouin: statistics in Python. JOSS, 3(31), 1026.