Skip to content

Prediction Baselines

Inputs - Gene, Brain, and Fusion = [Gene | Brain], all 512-D after preprocessing (embedding_strategies.genetics_gene_fm_pca512_v1, embedding_strategies.smri_free_surfer_pca512_v1, embedding_strategies.fusion_concat_gene_brain_1024_v1). - Tabular mode: raw ROI tables (smri_free_surfer_raw_176) and genetics PGS (genetics_pgs_20traits) for TabPFN / LR / GBDT baselines.

Context in integration plan

  • This recipe is the primary late-fusion baseline: compare Gene-only, Brain-only, and simple Fusion ([Gene | Brain]) features using shallow models.
  • Escalate beyond this (e.g., contrastive two-tower, EI stacking, hub tokens) only if Fusion clearly and consistently outperforms both single-modality baselines.
  • Keep this runbook as the reference when deciding whether more complex integration architectures are justified.

Models - Logistic Regression - penalty=L2, C∈{0.5,1,2}, solver=saga/liblinear, class_weight=balanced, max_iter=5,000. - LightGBM - num_leaves=31, learning_rate=0.05, n_estimators=1,000 with early stopping, scale_pos_weight ≈ N_neg/N_pos. - CatBoost - depth=6–8, learning_rate=0.05, iterations=2,000 with early stopping, loss_function=Logloss, auto class weights. - TabPFN (kb/model_cards/tabpfn.yaml) - Max 10k samples and 500 features per forward pass; chunk folds if N is larger. - Use for tabular baselines (raw ROI, PGS, ROI+PGS fusion) to benchmark whether representation learning beats a tabular FM.

Evaluation - Same CV folds across modalities. - Metrics: AUROC, AUPRC; report mean ± SD across folds. - Significance: DeLong or bootstrap for Fusion vs each single-modality on held-out predictions.

Outputs to save - Per-fold predictions and labels for later DeLong/bootstrap and calibration checks. - Embedding/harmonization IDs used to produce each feature set (copy from experiment config metadata).