Genomics Features¶
Gene FM embedding (genetics_gene_fm_pca512_v1)¶
- Models: Caduceus, Evo 2, HyenaDNA (conceptual), GENERaTOR, DNABERT-2 (see
kb/model_cards/). - RC hygiene:
- RC-equivariant encoders (e.g., Caduceus): verify equivariance with spot checks but no averaging required.
- Non-equivariant encoders (e.g., DNABERT-2, Evo 2, HyenaDNA-style): run on forward and reverse-complement sequences; average token embeddings before pooling, or apply RCCR-style consistency regularization when fine-tuning.^See reverse-complement consistency paper
- Tokenization: maintain deterministic k-mer/BPE framing; avoid random masking for inference exports.
- Pooling hierarchy:
- Token → exon (mean or CLS).
- Exon → gene (mean, or attention if pathway-weighted).
- Gene set → subject vector (concatenate curated genes; align order with manifest).
- Covariates: residualize age, sex, ancestry PCs 1–10, sequencing batch.
- Dimensionality: PCA → 512 (fit on train fold).
- Retrieve the latest recipe with
python scripts/manage_kb.py ops strategy genetics_gene_fm_pca512_v1.
UKB DNABERT‑2 exports (Yoon/GENNIElab; 111 genes × 768‑d)¶
If you are using the Yoon/GENNIElab shared UKB DNABERT‑2 drop (Google Drive), the raw asset is a directory tree:
- iids.npy, labels.npy, covariates_age.npy, covariates_sex.npy (all length N=28,932 in your current validated drop)
- Per gene: <EMBED_ROOT>/<GENE>/embeddings_{k}_layer_last.npy with 49 chunks and F=768
Recommended subject‑level views (all derived from X_3d ∈ R^{N×G×F}):
- Per-gene scalar reduction (current Stage‑1 baseline in gene-brain-CCA): reduce over embedding dims within each gene (e.g., mean over F=768) → X_ng ∈ R^{N×G} (here G=111).
- If you run PCA with target ≥G, it caps at 111 components: (k=\min(512, G)=111).
- Wide concat: X_wide ∈ R^{N×(G·F)} then PCA → 256–512 for CCA stability
- Gene-mean pool: X_mean ∈ R^{N×F} (global baseline)
- Gene-max / top‑k mean: X_max ∈ R^{N×F} or X_topk ∈ R^{N×F} (localized baseline)
Tip: standardize each gene block before pooling across genes so variance scale doesn’t dominate max.
Tabular genetics features (genetics_pgs_20traits)¶
- 20 curated UKB PGS + ancestry PCs.
- Preprocessing: mean-impute missing PGS, z-score each feature inside the train fold.
- Intended for tabular prediction baselines (including TabPFN) and for fusion with sMRI ROI tables.
Attribution¶
- Leave-one-gene-out (LOGO) ΔAUC with Wilcoxon across folds + FDR control remains the recommended approach once embeddings feed prediction models.
Long-context genomic FMs (regulatory windows)¶
- For exploratory regulatory-region embeddings (enhancers, promoters, long-range elements):
- Prefer Evo 2 / StripedHyena-2–style models for 100kb–1Mbp contexts.^See Evo 2 paper summary and systems note on multi-hybrid LMs.
- HyenaDNA provides architectural guidance for single-nucleotide, 1M-token contexts and motivates careful use of sequence-length warm-up when experimenting with long genetic windows.
- Start with shorter windows (e.g., ±100kb around TSS) before escalating to full 1Mbp context for cost reasons.