Skip to content

🧬 Genetics Foundation Models

DNA sequence foundation models for genomic representation learning

This section documents the genetics foundation models used to extract gene-level embeddings from raw genomic sequences (DNA/RNA) for downstream integration with brain imaging, behavioral phenotypes, and clinical outcomes.


πŸ“‹ Overview

All genetics FMs documented here:

  • Operate on nucleotide sequences (A, C, G, T) rather than pre-computed variant calls or SNP arrays
  • Support gene-level embeddings via forward/reverse-complement (RC) averaging and pooling strategies
  • Enable interpretability through attribution methods like LOGO Ξ”AUC
  • Are pretrained on large genomic corpora (human reference genomes, multi-species datasets, or RefSeq)

🎯 Model Registry

Model Architecture Key Feature Integration Role Documentation
Caduceus Mamba (BiMamba) + RC-equivariance Strand-robust, efficient long-context Primary gene encoder for UK Biobank WES Code Walkthrough
DNABERT-2 BERT (multi-species) BPE tokenization, cross-species transfer Alternative gene encoder; comparison baseline Code Walkthrough
Evo 2 StripedHyena (1M context) Ultra-long-range dependencies Exploratory; regulatory region capture Code Walkthrough
GENERator Generative 6-mer LM Generative modeling, sequence design Reference for generative vs discriminative Code Walkthrough
HyenaDNA Hyena implicit convolutions (1M context) Single-nucleotide, ultra-long genomic modeling Conceptual long-context genomics reference Code Walkthrough

πŸ”„ Usage Workflow

# 1. Extract gene sequences from hg38 reference (GENCODE annotations)
# 2. Tokenize with model-specific vocabulary (6-mer, BPE, or single-nucleotide)
# 3. Load pretrained checkpoint (Caduceus, DNABERT-2, Evo2, etc.)
# 4. Forward pass β†’ per-position embeddings
# 5. Verify RC equivariance (optional but recommended):
# embed(seq) β‰ˆ embed(reverse_complement(seq))
# 6. Mean pool over gene β†’ gene-level vector
# 7. Concatenate gene set β†’ subject genotype embedding
# 8. Log: gene_list, reference_version, embedding_strategy_id

Detailed steps:

  1. Extract sequences from reference genome (hg38) for target genes
  2. Tokenize using model-specific vocabularies (k-mers, BPE, or single-nucleotide)
  3. Embed forward and reverse-complement sequences
  4. Pool to gene-level representation (mean/CLS depending on model)
  5. Project to 512-D for cross-modal alignment with brain embeddings

πŸ”‘ Key Considerations

RC-equivariance

DNA has no inherent directionality; models like Caduceus enforce BiMamba RC-equivariance to avoid strand bias. For non-equivariant models, manually average forward and RC embeddings.

Variant handling

Foundation models operate on reference sequences by default. To incorporate subject-specific variants:

  • Patch reference with VCF alleles
  • Re-embed variant sequences
  • Compare Ξ”AUC between reference and variant embeddings (exploratory)

Attribution

Use LOGO (Leave-One-Gene-Out) Ξ”AUC to assess which genes contribute most to downstream prediction tasks (e.g., MDD risk, cognitive scores). See Yoon et al. BioKDD 2025 for protocol details.


πŸ”— Integration Targets

Genetics embeddings are integrated with:

  • sMRI IDPs (structural phenotypes) via CCA, late fusion, or contrastive alignment
  • fMRI embeddings (e.g., BrainLM, Brain-JEPA) for gene–brain–behaviour triangulation
  • Behavioral phenotypes (cognitive scores, psychiatric diagnoses) via multimodal prediction

Learn more: - Integration Strategy - Fusion protocols - Modality Features: Genomics - Preprocessing specs


πŸ“¦ Source Repositories

Click to view all source repositories **All genetics FM source code is tracked in** `external_repos/`: | Model | GitHub Repository | Local Clone | |:------|:------------------|:------------| | **Caduceus** | [kuleshov-group/caduceus](https://github.com/kuleshov-group/caduceus) | `external_repos/caduceus/` | | **DNABERT-2** | [Zhihan1996/DNABERT2](https://github.com/Zhihan1996/DNABERT2) | `external_repos/dnabert2/` | | **Evo 2** | [ArcInstitute/evo2](https://github.com/ArcInstitute/evo2) | `external_repos/evo2/` | | **GENERator** | [GenerTeam/GENERator](https://github.com/GenerTeam/GENERator) | `external_repos/generator/` | | **HyenaDNA** | [HazyResearch/hyena-dna](https://github.com/HazyResearch/hyena-dna) | `external_repos/hyena/` | **Each model has three interconnected resources:** - **Code Walkthrough** β†’ Step-by-step implementation guide - **YAML Model Card** β†’ Structured metadata and specs - **Integration Recipe** β†’ Embedding extraction and fusion protocols

πŸš€ Next Steps

  • βœ… Validate gene embedding reproducibility across cohorts (UK Biobank WES, Cha Hospital panel sequencing)
  • βœ… Benchmark LOGO Ξ”AUC stability under different embedding projection dimensions (256, 512, 1024)
  • πŸ”¬ Explore regulatory region embeddings (enhancers, promoters) with long-context models like Evo 2