Skip to content

Free Data Sources for FM Benchmarking

This page lists the real, publicly available datasets included in this benchmark hub, with proper attribution to their original sources.


🧬 DNA Sequence Data (Genomic Benchmarks)

These datasets are from the Genomic Benchmarks collection and are ready to use in toy_data/genomics/dna_sequences/.

Data Source

DNA sequence benchmarks are from Genomic Benchmarks by GreΕ‘ovΓ‘ et al.

Enhancer Classification (Cohn et al.)

Real ChIP-seq derived enhancer sequences

Split Sequences Source
Train 20,843 ChIP-seq experiments
Test 6,948 ChIP-seq experiments
toy_data/genomics/dna_sequences/enhancers_cohn/
β”œβ”€β”€ train.tsv  (sequence, label)
└── test.tsv   (sequence, label)

Run benchmark:

python -m fmbench run --suite SUITE-DNA-CLASS --model configs/model_evo2.yaml --out results/

Promoter Classification (EPD Database)

Real curated promoters from Eukaryotic Promoter Database

Split Sequences Source
Train 27,097 EPD database
Test 9,034 EPD database
toy_data/genomics/dna_sequences/promoters_nontata/
β”œβ”€β”€ train.tsv
└── test.tsv

Regulatory Elements (Ensembl)

Real annotations from ENCODE + Roadmap Epigenomics

Split Sequences Classes Notes
Train 1,500 enhancer, promoter, open_chromatin Toy version (500/class)
Test 57,713 enhancer, promoter, open_chromatin Full test set

Full Dataset

The toy train set is a stratified subsample. Full training set (231,348 samples) available from Genomic Benchmarks.

toy_data/genomics/dna_sequences/regulatory_ensembl/
β”œβ”€β”€ train.tsv
└── test.tsv

Nucleotide Transformer Benchmark Suite

The standard multi-task benchmark for DNA FMs

Split Sequences Tasks Notes
Train 1,500 18 tasks Toy version (500/class)
Test 38,822 18 tasks Full test set

Full Dataset

The toy train set is a stratified subsample. Full training set (493,242 samples) available from InstaDeepAI.

Tasks include:

  • Enhancer prediction
  • Promoter prediction (TATA / non-TATA)
  • Splice site prediction (donor / acceptor)
  • Histone modifications (H3, H3K4me3, H3K36me3, etc.)
toy_data/genomics/dna_sequences/nucleotide_transformer/
β”œβ”€β”€ train.tsv  (sequence, name, label, task)
└── test.tsv

What makes this data "real"?

Dataset Experimental Source Not Synthetic Because
Enhancers (Cohn) ChIP-seq Binding sites measured in cells
Promoters (EPD) TSS mapping Curated from experimental evidence
Regulatory (Ensembl) ENCODE + Roadmap 1000s of ChIP-seq/DNase-seq experiments
NT Benchmark Multiple Industry-standard benchmark from papers

🧬 Single-Cell RNA-seq Data

PBMC 3K Dataset

Pre-processed with real cell type labels

Metric Value
Cells 2,638
Genes 1,838
Classes 8 cell types
toy_data/genomics/pbmc_classification/
β”œβ”€β”€ X.npy      (expression matrix)
β”œβ”€β”€ y.npy      (labels)
β”œβ”€β”€ metadata.csv
└── label_map.csv

Data Source

PBMC data from 10x Genomics - "3k PBMCs from a Healthy Donor" dataset. Processed using Scanpy with cell type annotations from Louvain clustering.

Run benchmark:

python -m fmbench run --suite SUITE-GEN-CLASS-001 --model configs/model_geneformer.yaml --dataset DS-PBMC --out results/

Additional Data Sources

More DNA Sequence Data

Source URL Data Type
UCSC Genome Browser hgdownload.soe.ucsc.edu Reference genomes
ENCODE Portal encodeproject.org ChIP-seq, ATAC-seq
Roadmap Epigenomics roadmapepigenomics.org Epigenomic marks
ClinVar ncbi.nlm.nih.gov/clinvar Pathogenic variants

More Single-Cell Data

Source URL Data Type
10x Genomics 10xgenomics.com/datasets scRNA-seq
CellxGene cellxgene.cziscience.com Annotated atlases
Human Cell Atlas humancellatlas.org Multi-tissue atlases

Brain Imaging Data

Source URL Registration
OpenNeuro openneuro.org None required
HCP humanconnectome.org Required
UK Biobank ukbiobank.ac.uk Application required

Download More Data

# Activate environment
source scripts/activate_fmbench.sh

# Download additional genomic benchmarks
python << 'EOF'
from datasets import load_dataset

# Drosophila enhancers
ds = load_dataset("katarinagresova/Genomic_Benchmarks_drosophila_enhancers_stark")

# Mouse enhancers  
ds = load_dataset("katarinagresova/Genomic_Benchmarks_dummy_mouse_enhancers_ensembl")

# Coding vs intergenic
ds = load_dataset("katarinagresova/Genomic_Benchmarks_demo_coding_vs_intergenomic_seqs")
EOF

πŸ“š Citations & Acknowledgments

If you use data from this benchmark hub, please cite the original sources:

Genomic Benchmarks (DNA Sequence Data)

GREŠOVÁ, Katarína, et al. Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification. bioRxiv, 2022.

URL: https://www.biorxiv.org/content/10.1101/2022.06.08.495248

@article{gresova2022genomic,
  title={Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification},
  author={Gresova, Katarina and Martinek, Vlastimil and Cechak, David and Simecek, Petr and Alexiou, Panagiotis},
  journal={bioRxiv},
  year={2022},
  publisher={Cold Spring Harbor Laboratory},
  url={https://www.biorxiv.org/content/10.1101/2022.06.08.495248}
}

Nucleotide Transformer (NT Benchmark Data)

DALLA-TORRE, Hugo, et al. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. bioRxiv, 2023.

URL: https://www.biorxiv.org/content/10.1101/2023.01.11.523679

@article{dalla2023nucleotide,
  title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
  author={Dalla-Torre, Hugo and Gonzalez, Liam and Mendoza-Revilla, Javier and others},
  journal={bioRxiv},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

10x Genomics PBMC 3K (Single-Cell RNA-seq Data)

10x Genomics. 3k PBMCs from a Healthy Donor. 10x Genomics Datasets, 2016.

URL: https://www.10xgenomics.com/datasets/3-k-pbm-cs-from-a-healthy-donor-1-standard-1-1-0

Scanpy (Single-Cell Analysis)

WOLF, F. Alexander, ANGERER, Philipp, and THEIS, Fabian J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology, 2018.

@article{wolf2018scanpy,
  title={SCANPY: large-scale single-cell gene expression data analysis},
  author={Wolf, F Alexander and Angerer, Philipp and Theis, Fabian J},
  journal={Genome biology},
  volume={19},
  number={1},
  pages={1--5},
  year={2018},
  publisher={Springer}
}

ENCODE Consortium (Regulatory Data)

The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 2012.

Roadmap Epigenomics (Regulatory Data)

Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature, 2015.

Eukaryotic Promoter Database (Promoter Data)

DREOS, RenΓ©, et al. The eukaryotic promoter database in its 30th year: focus on non-vertebrate organisms. Nucleic Acids Research, 2017.


License

  • Genomic Benchmarks: Apache 2.0
  • 10x Genomics PBMC: Free for research use
  • ENCODE/Roadmap data: Public domain

Back to Home View Benchmarks