Skip to content

HyenaDNA

Overview

Type: Long-context DNA foundation model
Architecture: Decoder-only Hyena operators (implicit convolutions)
Modality: Nucleotide sequences (DNA)
Primary use (conceptual in KB): Reference architecture for 1M-token genomic modeling

Purpose & Design Philosophy

HyenaDNA demonstrates that sub-quadratic sequence operators can scale genomic language models to 1M-token contexts at single-nucleotide resolution, breaking the context-length barrier imposed by quadratic attention while preserving fine-grained variant information.^See arXiv:2306.15794 It is trained as a next-nucleotide predictor on the human reference genome and evaluated on standard regulatory element benchmarks, showing that carefully designed implicit convolutions can match or exceed attention-based DNA LMs with far fewer parameters and data.

Architecture Highlights

  • Operators: Hyena implicit convolutions with data-controlled gating (no self-attention).
  • Context length: Up to 1,000,000 tokens (1Mbp) with character-level tokenization.
  • Training tricks: Sequence-length warm-up schedule, gradient checkpointing for ultralong inputs, soft prompts for downstream adaptation.
  • Outputs: Per-position logits/embeddings suitable for downstream pooling (gene, enhancer, window-level features).

HyenaDNA is not currently vendored as code in this KB; instead, the generic StripedHyena codebase in external_repos/hyena/ is used for architectural code walkthroughs.

Integration Strategy

For Neuro-Omics KB

HyenaDNA is tracked as a long-context genomics reference:

  • Informs the design of ultra-long-context pipelines built around Evo 2 (StripedHyena 2) for regulatory-region and whole-locus embeddings.
  • Motivates experimenting with 100kb–1Mbp windows when studying distal regulatory effects on brain-related genes.
  • Suggests that sequence-length warm-up and soft prompting should be standard recipes when introducing Hyena/StripedHyena operators into neuro-omics models.

Concrete embeddings in this KB currently use Caduceus, DNABERT-2, Evo 2, and GENERaTOR; HyenaDNA is kept as a design anchor and potential future encoder once public checkpoints and code stabilise.

Reference Materials

Knowledge Base Resources

  • Paper summary: docs/generated/kb_curated/papers-md/hyenadna_2023.md
  • Paper card (YAML): kb/paper_cards/hyenadna_2023.yaml
  • Model card (YAML): kb/model_cards/hyenadna.yaml
  • Code Walkthrough: hyena_walkthrough.md (StripedHyena core)

Original Sources

  • Paper: HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution (NeurIPS 2023)^arXiv:2306.15794
  • Hyena / StripedHyena code: see StripedHyena GitHub and related Hyena project repositories referenced in the paper.