Ensemble Integration (EI)¶

Source: Li et al. (2022), Bioinformatics Advances
Type: Late fusion integration pattern
Best for: Heterogeneous feature spaces, small-to-medium datasets, interpretable multimodal fusion

Problem It Solves¶

Challenge: How to integrate heterogeneous biomedical data modalities (genetics, brain imaging, clinical data) that have very different structures, scales, and semantics without losing modality-specific signals.

Solution: Ensemble Integration (EI) treats each modality as a first-class citizen by: 1. Training specialized models per modality with appropriate inductive biases 2. Combining modality predictions via heterogeneous ensembles (stacking, selection, averaging) 3. Providing interpretable feature rankings across all modalities

Why traditional approaches fail: - Early integration (concatenate raw features) → loses modality-specific structure - Intermediate integration (shared embeddings) → emphasizes agreement, suppresses modality-unique signals - Single-model approaches → can't adapt architecture to each modality's semantics

Core Mechanics¶

1. Modality-Specific Model Training¶

Train diverse base classifiers per modality using algorithms matched to data structure:

# Genetics: sequence/graph data
gene_models = {
    'lr': LogisticRegression(C=0.01),
    'rf': RandomForestClassifier(n_estimators=100),
    'svm': SVC(kernel='rbf', probability=True),
    'xgb': XGBClassifier(max_depth=5)
}

# Brain: spatial/temporal features
brain_models = {
    'lr': LogisticRegression(C=0.1),
    'gbdt': LightGBMClassifier(num_leaves=31),
    'knn': KNeighborsClassifier(n_neighbors=5)
}

Key insight: Different modalities benefit from different inductive biases (trees for genetics, neighbors for imaging).

2. Late Fusion Strategies¶

Simple averaging:

ensemble_pred = (gene_pred_proba + brain_pred_proba) / 2

Ensemble selection (Li et al. method): - Iteratively add models that improve validation performance - Greedy forward selection with replacement - Automatically weights models by contribution

Stacking with meta-learner:

# Stack predictions from all base models
meta_features = np.hstack([gene_preds, brain_preds])
meta_model = LogisticRegression()
meta_model.fit(meta_features, y_train)

⚠️ Critical: Stacking must be fold-proper to avoid leakage—train meta-learner only on out-of-fold base predictions.

3. Interpretability via Feature Ranking¶

Cross-modality feature importance: 1. For each base model, extract feature importances (coefficients, SHAP values, permutation importance) 2. Aggregate via ensemble weights 3. Rank features across all modalities

Result: Identify which genes AND which brain regions drive predictions, weighted by ensemble contribution.

When to Use¶

✅ Use Ensemble Integration when: - Modalities have heterogeneous structures (sequences, images, graphs, tables) - Dataset size is small-to-medium (< 10k samples) - Missing data is common (not all subjects have all modalities) - Interpretability is critical for clinical translation - Baseline comparisons needed (per-modality vs. fusion performance) - Computing resources are limited (no end-to-end training needed)

✅ Particularly well-suited for: - Gene-brain-behavior prediction in neuro-omics - Multi-site cohort integration with batch effects - Clinical decision support requiring feature-level explanations - Research settings exploring which modalities contribute most

When to Defer¶

⚠️ Defer to more advanced methods when: - Modalities have strong cross-modal dependencies (e.g., paired image-text) - Large datasets available (> 100k samples) enabling end-to-end joint training - Real-time deployment required (ensemble overhead too high) - Shared representations needed (e.g., cross-modal retrieval tasks)

⚠️ Consider alternatives: - Two-tower contrastive if need aligned embedding space for retrieval - Early fusion if modalities naturally align (e.g., multi-view same subject) - Mixture-of-Experts if need learned routing by modality

Adoption in Our Neuro-Omics Pipeline¶

Current Implementation¶

Per-modality models: - Genetics: LR + LightGBM on Caduceus/DNABERT-2 embeddings (512-D) - Brain: LR + LightGBM on BrainLM/SwiFT embeddings (512-D) - Fusion: Ensemble selection or simple stacking with LR meta-learner

Workflow:

# 1. Extract embeddings per modality
python extract_gene_embeddings.py --model caduceus --out gene_emb.npy
python extract_brain_embeddings.py --model brainlm --out brain_emb.npy

# 2. Train per-modality models
python train_per_modality.py --modality gene --models lr,gbdt
python train_per_modality.py --modality brain --models lr,gbdt

# 3. Ensemble integration
python ensemble_fusion.py --strategy stacking --meta_model lr

Evaluation metrics: - Per-modality AUROC/AUPRC - Fusion AUROC/AUPRC - DeLong test: Fusion vs. max(Gene, Brain) - Feature importance rankings

Integration with ARPA-H BOM¶

EI provides the baseline late fusion in our escalation strategy:

1. Ensemble Integration (baseline) ✓ Current
    ↓ If fusion wins (p < 0.05)
2. Two-tower contrastive
    ↓ If gains plateau
3. EI stacking with hub tokens
    ↓ Last resort
4. Full early fusion (TAPE-style)

Why start with EI: - Establishes whether fusion helps at all before complex architectures - Provides interpretable baseline for regulatory/clinical validation - Enables per-modality ablations to identify which data types matter - Computationally cheap to iterate on cohort definitions and confounds

Caveats and Best Practices¶

⚠️ Leakage Prevention¶

Problem: If meta-learner sees in-fold predictions, it overfits to noise.

Solution: Always use out-of-fold predictions for stacking:

# WRONG: Train meta-learner on training predictions
meta_model.fit(base_preds_train, y_train)  # Leakage!

# RIGHT: Train meta-learner on out-of-fold predictions
oof_preds = cross_val_predict(base_model, X_train, y_train, cv=5)
meta_model.fit(oof_preds, y_train)

⚠️ Confound Control¶

Problem: Batch effects (site, scanner) can dominate modality signals.

Solution: Residualize before training base models:

# Per modality, residualize confounds
gene_emb_residual = residualize(gene_emb, confounds=['age', 'sex', 'site', 'PCs'])
brain_emb_residual = residualize(brain_emb, confounds=['age', 'sex', 'site', 'FD'])

⚠️ Meta-Learner Simplicity¶

Problem: Complex meta-learners (deep NNs) can overfit ensemble predictions.

Solution: Use simple meta-learners (LR, Ridge) unless >10k samples:

# Prefer: Regularized logistic regression
meta_model = LogisticRegression(penalty='l2', C=1.0, max_iter=1000)

# Avoid: Unless large N
meta_model = MLPClassifier(hidden_layers=(64, 32))  # Risk overfitting

⚠️ Missing Modality Handling¶

Problem: Not all subjects have both gene and brain data.

Solution: Train modality-specific fallback models:

if has_both_modalities(subject):
    pred = ensemble_model.predict(gene_emb, brain_emb)
elif has_gene_only(subject):
    pred = gene_model.predict(gene_emb)
elif has_brain_only(subject):
    pred = brain_model.predict(brain_emb)

Practical Implementation Guide¶

Step 1: Choose Base Models¶

Diversity is key — use algorithms with different inductive biases:

Modality	Recommended Models	Rationale
Genetics (sequence)	LR, XGBoost, SVM-RBF	Linear + trees + kernels
Brain (imaging)	LR, LightGBM, k-NN	Linear + trees + locality
Behavior (tabular)	LR, RandomForest, Ridge	Simple + robust to correlation

Step 2: Train with Proper CV¶

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Per-modality training with OOF predictions
for modality in ['gene', 'brain']:
    oof_preds = []
    for train_idx, val_idx in skf.split(X, y):
        X_train, y_train = X[train_idx], y[train_idx]
        X_val = X[val_idx]
        
        model.fit(X_train, y_train)
        oof_preds.append(model.predict_proba(X_val)[:, 1])
    
    # Save OOF predictions for meta-learner
    save_oof_predictions(modality, np.concatenate(oof_preds))

Step 3: Meta-Learner Training¶

# Load OOF predictions
gene_oof = load_oof_predictions('gene')
brain_oof = load_oof_predictions('brain')

# Stack into meta-features
meta_X = np.column_stack([gene_oof, brain_oof])

# Train meta-learner
meta_model = LogisticRegression(C=1.0, max_iter=1000)
meta_model.fit(meta_X, y_train)

# Evaluate on held-out test set
test_preds = np.column_stack([
    gene_model.predict_proba(gene_test)[:, 1],
    brain_model.predict_proba(brain_test)[:, 1]
])
test_auc = roc_auc_score(y_test, meta_model.predict_proba(test_preds)[:, 1])

Step 4: Feature Interpretation¶

import shap

# Compute SHAP values for each base model
gene_shap = shap.TreeExplainer(gene_model).shap_values(gene_emb)
brain_shap = shap.TreeExplainer(brain_model).shap_values(brain_emb)

# Weight by ensemble contribution (meta-learner coefficients)
gene_weight = np.abs(meta_model.coef_[0][0])
brain_weight = np.abs(meta_model.coef_[0][1])

# Aggregate feature importance
weighted_gene_importance = gene_shap.mean(axis=0) * gene_weight
weighted_brain_importance = brain_shap.mean(axis=0) * brain_weight

# Rank across all features
all_importance = np.concatenate([weighted_gene_importance, weighted_brain_importance])
top_features = np.argsort(all_importance)[::-1][:20]

Reference Materials¶

Primary paper: - Ensemble Integration (Li 2022) — Full paper summary

Related KB resources: - Integration Strategy — Overall fusion approach - Design Patterns — Pattern 1: Late Fusion - CCA + Permutation Recipe — Statistical testing - Prediction Baselines — Comparison protocol

Integration cards: - Oncology Multimodal Review — Broader fusion taxonomy

Model documentation: - Genetics Models — Gene embedding extraction - Brain Models — Brain embedding extraction

Experiment configs: - configs/experiments/02_prediction_baselines.yaml — EI implementation template

Next Steps in Our Pipeline¶

Baseline EI implementation — LR + GBDT per modality with stacking meta-learner
Per-modality ablations — Which modality contributes most? (Gene vs. Brain vs. Fusion)
Feature interpretation — Identify top predictive genes and brain regions
Cohort extension — Test EI on Cha Hospital developmental cohort
Escalation decision — If fusion wins significantly, move to two-tower contrastive

Success criteria for escalation: - Fusion AUROC > max(Gene, Brain) with p < 0.05 (DeLong test) - Gains observed across multiple phenotypes (cognitive, diagnostic) - Stable across cross-validation folds (not driven by outliers)