Genomics Features¶

Gene FM embedding (`genetics_gene_fm_pca512_v1`)¶

Models: Caduceus, Evo 2, HyenaDNA (conceptual), GENERaTOR, DNABERT-2 (see kb/model_cards/).
RC hygiene:
RC-equivariant encoders (e.g., Caduceus): verify equivariance with spot checks but no averaging required.
Non-equivariant encoders (e.g., DNABERT-2, Evo 2, HyenaDNA-style): run on forward and reverse-complement sequences; average token embeddings before pooling, or apply RCCR-style consistency regularization when fine-tuning.^See reverse-complement consistency paper
Tokenization: maintain deterministic k-mer/BPE framing; avoid random masking for inference exports.
Pooling hierarchy:
Token → exon (mean or CLS).
Exon → gene (mean, or attention if pathway-weighted).
Gene set → subject vector (concatenate curated genes; align order with manifest).
Covariates: residualize age, sex, ancestry PCs 1–10, sequencing batch.
Dimensionality: PCA → 512 (fit on train fold).
Retrieve the latest recipe with python scripts/manage_kb.py ops strategy genetics_gene_fm_pca512_v1.

UKB DNABERT‑2 exports (Yoon/GENNIElab; 111 genes × 768‑d)¶

If you are using the Yoon/GENNIElab shared UKB DNABERT‑2 drop (Google Drive), the raw asset is a directory tree: - iids.npy, labels.npy, covariates_age.npy, covariates_sex.npy (all length N=28,932 in your current validated drop) - Per gene: <EMBED_ROOT>/<GENE>/embeddings_{k}_layer_last.npy with 49 chunks and F=768

Recommended subject‑level views (all derived from X_3d ∈ R^{N×G×F}): - Per-gene scalar reduction (current Stage‑1 baseline in gene-brain-CCA): reduce over embedding dims within each gene (e.g., mean over F=768) → X_ng ∈ R^{N×G} (here G=111). - If you run PCA with target ≥G, it caps at 111 components: (k=\min(512, G)=111). - Wide concat: X_wide ∈ R^{N×(G·F)} then PCA → 256–512 for CCA stability - Gene-mean pool: X_mean ∈ R^{N×F} (global baseline) - Gene-max / top‑k mean: X_max ∈ R^{N×F} or X_topk ∈ R^{N×F} (localized baseline)

Tip: standardize each gene block before pooling across genes so variance scale doesn’t dominate max.

Tabular genetics features (`genetics_pgs_20traits`)¶

20 curated UKB PGS + ancestry PCs.
Preprocessing: mean-impute missing PGS, z-score each feature inside the train fold.
Intended for tabular prediction baselines (including TabPFN) and for fusion with sMRI ROI tables.

Attribution¶

Leave-one-gene-out (LOGO) ΔAUC with Wilcoxon across folds + FDR control remains the recommended approach once embeddings feed prediction models.

Long-context genomic FMs (regulatory windows)¶

For exploratory regulatory-region embeddings (enhancers, promoters, long-range elements):
Prefer Evo 2 / StripedHyena-2–style models for 100kb–1Mbp contexts.^See Evo 2 paper summary and systems note on multi-hybrid LMs.
HyenaDNA provides architectural guidance for single-nucleotide, 1M-token contexts and motivates careful use of sequence-length warm-up when experimenting with long genetic windows.
Start with shorter windows (e.g., ±100kb around TSS) before escalating to full 1Mbp context for cost reasons.