Evo 2¶

Overview¶

Type: Ultra-long-context DNA foundation model
Architecture: StripedHyena 2 (Hyena + attention)
Modality: Nucleotide sequences (DNA/RNA)
Primary use: Regulatory region embeddings with 1M token context

Purpose & Design Philosophy¶

Evo2 extends DNA foundation models to 1 million token contexts using StripedHyena 2 architecture (hybrid Hyena operators + attention layers). This enables modeling entire genes with full regulatory context (promoters, enhancers, 3D loop anchors) in a single forward pass, capturing long-range genomic interactions that shorter-context models miss.

Key innovation: 1M context via sub-quadratic Hyena operators → whole-locus modeling including distal regulatory elements.

Architecture Highlights¶

Backbone: StripedHyena 2 (alternating Hyena convolution + multi-head attention)
Context length: 1,048,576 tokens (~1 Mb of genomic sequence)
Tokenization: Single-base or 2-mer/4-mer (preserves fine resolution)
Pretraining: Masked LM on human + multi-species genomes
Output: Per-position embeddings → region/gene pooling

Integration Strategy¶

For Neuro-Omics KB¶

Embedding recipe: genetics_regulatory_evo2_v1 (exploratory) - Extract extended gene loci (gene + 100kb upstream/downstream for regulatory context) - Tokenize with Evo2 vocabulary (typically single-base or 2-mer) - Forward pass → per-position embeddings for full locus - RC handling: Evo2 not explicitly RC-equivariant → average forward/RC embeddings - Pool over gene CDS → gene embedding - Optionally extract regulatory region embeddings (promoter, enhancers) separately - Project to 512-D via PCA - Residualize: age, sex, ancestry PCs, batch

Fusion targets: - Gene expression prediction: Regulatory context improves gene-phenotype links - Enhancer-gene mapping: Identify distal elements affecting brain-expressed genes - 3D genome modeling: Capture loop anchors and TAD boundaries (exploratory)

For ARPA-H Brain-Omics Models¶

Evo2 enables whole-locus genetic representations for Brain-Omics systems: - 1M context captures regulatory grammar spanning hundreds of kilobases - Critical for brain-specific enhancers distant from target genes - Can embed entire pathways or multi-gene clusters in single pass - Blueprint for ultra-long-context multimodal architectures (e.g., long-range EEG patterns)

Embedding Extraction Workflow¶

# 1. Extract extended loci (gene ± 100kb from hg38)
# 2. Tokenize with Evo2 single-base or k-mer vocabulary
# 3. Load pretrained Evo2 checkpoint
# 4. Forward pass (may require chunking if >1M tokens)
# 5. Extract embeddings for:
#    - Gene CDS (coding sequence)
#    - Promoter (-2kb to TSS)
#    - Predicted enhancers (if annotated)
# 6. RC-average forward + reverse-complement
# 7. Pool each region → separate vectors or concatenate
# 8. Log: context_length, regulatory_elements_included

Strengths & Limitations¶

Strengths¶

Ultra-long context: 1M tokens captures distal regulatory elements
Whole-locus modeling: No need to manually select regulatory windows
Sub-quadratic scaling: Hyena operators enable long context without full attention cost
Regulatory grammar: Can learn enhancer-promoter interactions end-to-end

Limitations¶

Massive memory footprint: 1M context requires high-memory GPUs (80GB+ A100/H100)
Slower inference: Even with Hyena, 1M tokens slower than short-context models
Overkill for coding sequences: Most genes <10kb don't need 1M context
Checkpoint availability: Fewer public weights vs. DNABERT-2/Caduceus

When to Use Evo2¶

✅ Use when: - Need regulatory context for brain-specific gene expression - Studying long-range enhancer-promoter interactions - Have sufficient compute (80GB+ GPU, large batch sizes) - Exploring 3D genome structure embeddings

⚠️ Defer until: - Caduceus/DNABERT-2 baselines complete - Regulatory element analysis becomes critical - GPU resources available for long-context experiments

⚠️ Consider alternatives: - Caduceus: For coding sequences without regulatory context - DNABERT-2: For standard gene embeddings with manageable compute - GENERator: If generative modeling is priority

Reference Materials¶

Knowledge Base Resources¶

Curated materials in this KB: - Paper Summary (PDF Notes): Evo2 (2024) - Code walkthrough: Evo2 walkthrough - Model card (YAML): kb/model_cards/evo2.yaml - Paper card (YAML): kb/paper_cards/evo2_2024.yaml

Integration recipes: - Modality Features: Genomics - Integration Strategy

Original Sources¶

Source code repositories: - Local copy: external_repos/evo2/ - Official GitHub: ArcInstitute/evo2

Original paper: - Title: "Genome modeling and design across all domains of life with Evo 2" - Authors: Arc Institute Team - Published: bioRxiv preprint, February 2025 - Link: bioRxiv:2025.02.18.638918 - PDF Notes: evo2_2024.pdf

Next Steps in Our Pipeline¶

Pilot study: Embed 5-10 brain-expressed genes with known distal enhancers
Context ablation: Test 10kb vs. 100kb vs. 1M context for gene-brain CCA
Memory profiling: Document GPU requirements and chunking strategies
Enhancer-gene links: Compare Evo2 regulatory embeddings vs. eQTL databases
ARPA-H vision: Explore Evo2-style long context for other modalities (EEG, longitudinal)

Engineering Notes¶

GPU requirements: 80GB+ A100 or H100 for full 1M context
Chunk long sequences if needed; aggregate chunk embeddings carefully
Log context length used (may be <1M for most genes)
RC-averaging doubles compute; consider caching forward embeddings
When comparing to short-context models, isolate regulatory contribution via ablation