🧬 Genetics Foundation Models¶

DNA sequence foundation models for genomic representation learning

This section documents the genetics foundation models used to extract gene-level embeddings from raw genomic sequences (DNA/RNA) for downstream integration with brain imaging, behavioral phenotypes, and clinical outcomes.

📋 Overview¶

All genetics FMs documented here:

Operate on nucleotide sequences (A, C, G, T) rather than pre-computed variant calls or SNP arrays
Support gene-level embeddings via forward/reverse-complement (RC) averaging and pooling strategies
Enable interpretability through attribution methods like LOGO ΔAUC
Are pretrained on large genomic corpora (human reference genomes, multi-species datasets, or RefSeq)

🎯 Model Registry¶

Model	Architecture	Key Feature	Integration Role	Documentation
Caduceus	Mamba (BiMamba) + RC-equivariance	Strand-robust, efficient long-context	Primary gene encoder for UK Biobank WES	Code Walkthrough
DNABERT-2	BERT (multi-species)	BPE tokenization, cross-species transfer	Alternative gene encoder; comparison baseline	Code Walkthrough
Evo 2	StripedHyena (1M context)	Ultra-long-range dependencies	Exploratory; regulatory region capture	Code Walkthrough
GENERator	Generative 6-mer LM	Generative modeling, sequence design	Reference for generative vs discriminative	Code Walkthrough
HyenaDNA	Hyena implicit convolutions (1M context)	Single-nucleotide, ultra-long genomic modeling	Conceptual long-context genomics reference	Code Walkthrough

🔄 Usage Workflow¶

# 1. Extract gene sequences from hg38 reference (GENCODE annotations)

# 2. Tokenize with model-specific vocabulary (6-mer, BPE, or single-nucleotide)

# 3. Load pretrained checkpoint (Caduceus, DNABERT-2, Evo2, etc.)

# 4. Forward pass → per-position embeddings

# 5. Verify RC equivariance (optional but recommended):

#     embed(seq) ≈ embed(reverse_complement(seq))

# 6. Mean pool over gene → gene-level vector

# 7. Concatenate gene set → subject genotype embedding

# 8. Log: gene_list, reference_version, embedding_strategy_id

Detailed steps:

Extract sequences from reference genome (hg38) for target genes
Tokenize using model-specific vocabularies (k-mers, BPE, or single-nucleotide)
Embed forward and reverse-complement sequences
Pool to gene-level representation (mean/CLS depending on model)
Project to 512-D for cross-modal alignment with brain embeddings

🔑 Key Considerations¶

RC-equivariance¶

DNA has no inherent directionality; models like Caduceus enforce BiMamba RC-equivariance to avoid strand bias. For non-equivariant models, manually average forward and RC embeddings.

Variant handling¶

Foundation models operate on reference sequences by default. To incorporate subject-specific variants:

Patch reference with VCF alleles
Re-embed variant sequences
Compare ΔAUC between reference and variant embeddings (exploratory)

Attribution¶

Use LOGO (Leave-One-Gene-Out) ΔAUC to assess which genes contribute most to downstream prediction tasks (e.g., MDD risk, cognitive scores). See Yoon et al. BioKDD 2025 for protocol details.

🔗 Integration Targets¶

Genetics embeddings are integrated with:

sMRI IDPs (structural phenotypes) via CCA, late fusion, or contrastive alignment
fMRI embeddings (e.g., BrainLM, Brain-JEPA) for gene–brain–behaviour triangulation
Behavioral phenotypes (cognitive scores, psychiatric diagnoses) via multimodal prediction

Learn more: - Integration Strategy - Fusion protocols - Modality Features: Genomics - Preprocessing specs

📦 Source Repositories¶

Click to view all source repositories

**All genetics FM source code is tracked in** `external_repos/`: | Model | GitHub Repository | Local Clone | |:------|:------------------|:------------| | **Caduceus** | [kuleshov-group/caduceus](https://github.com/kuleshov-group/caduceus) | `external_repos/caduceus/` | | **DNABERT-2** | [Zhihan1996/DNABERT2](https://github.com/Zhihan1996/DNABERT2) | `external_repos/dnabert2/` | | **Evo 2** | [ArcInstitute/evo2](https://github.com/ArcInstitute/evo2) | `external_repos/evo2/` | | **GENERator** | [GenerTeam/GENERator](https://github.com/GenerTeam/GENERator) | `external_repos/generator/` | | **HyenaDNA** | [HazyResearch/hyena-dna](https://github.com/HazyResearch/hyena-dna) | `external_repos/hyena/` | **Each model has three interconnected resources:** - **Code Walkthrough** → Step-by-step implementation guide - **YAML Model Card** → Structured metadata and specs - **Integration Recipe** → Embedding extraction and fusion protocols

🚀 Next Steps¶

✅ Validate gene embedding reproducibility across cohorts (UK Biobank WES, Cha Hospital panel sequencing)
✅ Benchmark LOGO ΔAUC stability under different embedding projection dimensions (256, 512, 1024)
🔬 Explore regulatory region embeddings (enhancers, promoters) with long-context models like Evo 2