𧬠Genetics Foundation Models¶
DNA sequence foundation models for genomic representation learning
This section documents the genetics foundation models used to extract gene-level embeddings from raw genomic sequences (DNA/RNA) for downstream integration with brain imaging, behavioral phenotypes, and clinical outcomes.
π Overview¶
All genetics FMs documented here:
- Operate on nucleotide sequences (A, C, G, T) rather than pre-computed variant calls or SNP arrays
- Support gene-level embeddings via forward/reverse-complement (RC) averaging and pooling strategies
- Enable interpretability through attribution methods like LOGO ΞAUC
- Are pretrained on large genomic corpora (human reference genomes, multi-species datasets, or RefSeq)
π― Model Registry¶
| Model | Architecture | Key Feature | Integration Role | Documentation |
|---|---|---|---|---|
| Caduceus | Mamba (BiMamba) + RC-equivariance | Strand-robust, efficient long-context | Primary gene encoder for UK Biobank WES | Code Walkthrough |
| DNABERT-2 | BERT (multi-species) | BPE tokenization, cross-species transfer | Alternative gene encoder; comparison baseline | Code Walkthrough |
| Evo 2 | StripedHyena (1M context) | Ultra-long-range dependencies | Exploratory; regulatory region capture | Code Walkthrough |
| GENERator | Generative 6-mer LM | Generative modeling, sequence design | Reference for generative vs discriminative | Code Walkthrough |
| HyenaDNA | Hyena implicit convolutions (1M context) | Single-nucleotide, ultra-long genomic modeling | Conceptual long-context genomics reference | Code Walkthrough |
π Usage Workflow¶
# 2. Tokenize with model-specific vocabulary (6-mer, BPE, or single-nucleotide)
# 3. Load pretrained checkpoint (Caduceus, DNABERT-2, Evo2, etc.)
# 4. Forward pass β per-position embeddings
# 5. Verify RC equivariance (optional but recommended):
# embed(seq) β embed(reverse_complement(seq))
# 6. Mean pool over gene β gene-level vector
# 7. Concatenate gene set β subject genotype embedding
# 8. Log: gene_list, reference_version, embedding_strategy_id
Detailed steps:
- Extract sequences from reference genome (hg38) for target genes
- Tokenize using model-specific vocabularies (k-mers, BPE, or single-nucleotide)
- Embed forward and reverse-complement sequences
- Pool to gene-level representation (mean/CLS depending on model)
- Project to 512-D for cross-modal alignment with brain embeddings
π Key Considerations¶
RC-equivariance¶
DNA has no inherent directionality; models like Caduceus enforce BiMamba RC-equivariance to avoid strand bias. For non-equivariant models, manually average forward and RC embeddings.
Variant handling¶
Foundation models operate on reference sequences by default. To incorporate subject-specific variants:
- Patch reference with VCF alleles
- Re-embed variant sequences
- Compare ΞAUC between reference and variant embeddings (exploratory)
Attribution¶
Use LOGO (Leave-One-Gene-Out) ΞAUC to assess which genes contribute most to downstream prediction tasks (e.g., MDD risk, cognitive scores). See Yoon et al. BioKDD 2025 for protocol details.
π Integration Targets¶
Genetics embeddings are integrated with:
- sMRI IDPs (structural phenotypes) via CCA, late fusion, or contrastive alignment
- fMRI embeddings (e.g., BrainLM, Brain-JEPA) for geneβbrainβbehaviour triangulation
- Behavioral phenotypes (cognitive scores, psychiatric diagnoses) via multimodal prediction
Learn more: - Integration Strategy - Fusion protocols - Modality Features: Genomics - Preprocessing specs
π¦ Source Repositories¶
Click to view all source repositories
**All genetics FM source code is tracked in** `external_repos/`: | Model | GitHub Repository | Local Clone | |:------|:------------------|:------------| | **Caduceus** | [kuleshov-group/caduceus](https://github.com/kuleshov-group/caduceus) | `external_repos/caduceus/` | | **DNABERT-2** | [Zhihan1996/DNABERT2](https://github.com/Zhihan1996/DNABERT2) | `external_repos/dnabert2/` | | **Evo 2** | [ArcInstitute/evo2](https://github.com/ArcInstitute/evo2) | `external_repos/evo2/` | | **GENERator** | [GenerTeam/GENERator](https://github.com/GenerTeam/GENERator) | `external_repos/generator/` | | **HyenaDNA** | [HazyResearch/hyena-dna](https://github.com/HazyResearch/hyena-dna) | `external_repos/hyena/` | **Each model has three interconnected resources:** - **Code Walkthrough** β Step-by-step implementation guide - **YAML Model Card** β Structured metadata and specs - **Integration Recipe** β Embedding extraction and fusion protocolsπ Next Steps¶
- β Validate gene embedding reproducibility across cohorts (UK Biobank WES, Cha Hospital panel sequencing)
- β Benchmark LOGO ΞAUC stability under different embedding projection dimensions (256, 512, 1024)
- π¬ Explore regulatory region embeddings (enhancers, promoters) with long-context models like Evo 2