Free Data Sources for FM Benchmarking
This page lists the real, publicly available datasets included in this benchmark hub, with proper attribution to their original sources.
𧬠DNA Sequence Data (Genomic Benchmarks)
These datasets are from the Genomic Benchmarks collection and are ready to use in toy_data/genomics/dna_sequences/.
Data Source
DNA sequence benchmarks are from Genomic Benchmarks by GreΕ‘ovΓ‘ et al.
Enhancer Classification (Cohn et al.)
Real ChIP-seq derived enhancer sequences
| Split | Sequences | Source |
|---|---|---|
| Train | 20,843 | ChIP-seq experiments |
| Test | 6,948 | ChIP-seq experiments |
toy_data/genomics/dna_sequences/enhancers_cohn/
βββ train.tsv (sequence, label)
βββ test.tsv (sequence, label)
Run benchmark:
python -m fmbench run --suite SUITE-DNA-CLASS --model configs/model_evo2.yaml --out results/
Promoter Classification (EPD Database)
Real curated promoters from Eukaryotic Promoter Database
| Split | Sequences | Source |
|---|---|---|
| Train | 27,097 | EPD database |
| Test | 9,034 | EPD database |
toy_data/genomics/dna_sequences/promoters_nontata/
βββ train.tsv
βββ test.tsv
Regulatory Elements (Ensembl)
Real annotations from ENCODE + Roadmap Epigenomics
| Split | Sequences | Classes | Notes |
|---|---|---|---|
| Train | 1,500 | enhancer, promoter, open_chromatin | Toy version (500/class) |
| Test | 57,713 | enhancer, promoter, open_chromatin | Full test set |
Full Dataset
The toy train set is a stratified subsample. Full training set (231,348 samples) available from Genomic Benchmarks.
toy_data/genomics/dna_sequences/regulatory_ensembl/
βββ train.tsv
βββ test.tsv
Nucleotide Transformer Benchmark Suite
The standard multi-task benchmark for DNA FMs
| Split | Sequences | Tasks | Notes |
|---|---|---|---|
| Train | 1,500 | 18 tasks | Toy version (500/class) |
| Test | 38,822 | 18 tasks | Full test set |
Full Dataset
The toy train set is a stratified subsample. Full training set (493,242 samples) available from InstaDeepAI.
Tasks include:
- Enhancer prediction
- Promoter prediction (TATA / non-TATA)
- Splice site prediction (donor / acceptor)
- Histone modifications (H3, H3K4me3, H3K36me3, etc.)
toy_data/genomics/dna_sequences/nucleotide_transformer/
βββ train.tsv (sequence, name, label, task)
βββ test.tsv
Data Source
From InstaDeepAI/nucleotide_transformer_downstream_tasks_revised. See Dalla-Torre et al. (2023).
What makes this data "real"?
| Dataset | Experimental Source | Not Synthetic Because |
|---|---|---|
| Enhancers (Cohn) | ChIP-seq | Binding sites measured in cells |
| Promoters (EPD) | TSS mapping | Curated from experimental evidence |
| Regulatory (Ensembl) | ENCODE + Roadmap | 1000s of ChIP-seq/DNase-seq experiments |
| NT Benchmark | Multiple | Industry-standard benchmark from papers |
𧬠Single-Cell RNA-seq Data
PBMC 3K Dataset
Pre-processed with real cell type labels
| Metric | Value |
|---|---|
| Cells | 2,638 |
| Genes | 1,838 |
| Classes | 8 cell types |
toy_data/genomics/pbmc_classification/
βββ X.npy (expression matrix)
βββ y.npy (labels)
βββ metadata.csv
βββ label_map.csv
Data Source
PBMC data from 10x Genomics - "3k PBMCs from a Healthy Donor" dataset. Processed using Scanpy with cell type annotations from Louvain clustering.
Run benchmark:
python -m fmbench run --suite SUITE-GEN-CLASS-001 --model configs/model_geneformer.yaml --dataset DS-PBMC --out results/
Additional Data Sources
More DNA Sequence Data
| Source | URL | Data Type |
|---|---|---|
| UCSC Genome Browser | hgdownload.soe.ucsc.edu | Reference genomes |
| ENCODE Portal | encodeproject.org | ChIP-seq, ATAC-seq |
| Roadmap Epigenomics | roadmapepigenomics.org | Epigenomic marks |
| ClinVar | ncbi.nlm.nih.gov/clinvar | Pathogenic variants |
More Single-Cell Data
| Source | URL | Data Type |
|---|---|---|
| 10x Genomics | 10xgenomics.com/datasets | scRNA-seq |
| CellxGene | cellxgene.cziscience.com | Annotated atlases |
| Human Cell Atlas | humancellatlas.org | Multi-tissue atlases |
Brain Imaging Data
| Source | URL | Registration |
|---|---|---|
| OpenNeuro | openneuro.org | None required |
| HCP | humanconnectome.org | Required |
| UK Biobank | ukbiobank.ac.uk | Application required |
Download More Data
# Activate environment
source scripts/activate_fmbench.sh
# Download additional genomic benchmarks
python << 'EOF'
from datasets import load_dataset
# Drosophila enhancers
ds = load_dataset("katarinagresova/Genomic_Benchmarks_drosophila_enhancers_stark")
# Mouse enhancers
ds = load_dataset("katarinagresova/Genomic_Benchmarks_dummy_mouse_enhancers_ensembl")
# Coding vs intergenic
ds = load_dataset("katarinagresova/Genomic_Benchmarks_demo_coding_vs_intergenomic_seqs")
EOF
π Citations & Acknowledgments
If you use data from this benchmark hub, please cite the original sources:
Genomic Benchmarks (DNA Sequence Data)
GREΕ OVΓ, KatarΓna, et al. Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification. bioRxiv, 2022.
URL: https://www.biorxiv.org/content/10.1101/2022.06.08.495248
@article{gresova2022genomic,
title={Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification},
author={Gresova, Katarina and Martinek, Vlastimil and Cechak, David and Simecek, Petr and Alexiou, Panagiotis},
journal={bioRxiv},
year={2022},
publisher={Cold Spring Harbor Laboratory},
url={https://www.biorxiv.org/content/10.1101/2022.06.08.495248}
}
Nucleotide Transformer (NT Benchmark Data)
DALLA-TORRE, Hugo, et al. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. bioRxiv, 2023.
URL: https://www.biorxiv.org/content/10.1101/2023.01.11.523679
@article{dalla2023nucleotide,
title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
author={Dalla-Torre, Hugo and Gonzalez, Liam and Mendoza-Revilla, Javier and others},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
10x Genomics PBMC 3K (Single-Cell RNA-seq Data)
10x Genomics. 3k PBMCs from a Healthy Donor. 10x Genomics Datasets, 2016.
URL: https://www.10xgenomics.com/datasets/3-k-pbm-cs-from-a-healthy-donor-1-standard-1-1-0
Scanpy (Single-Cell Analysis)
WOLF, F. Alexander, ANGERER, Philipp, and THEIS, Fabian J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology, 2018.
@article{wolf2018scanpy,
title={SCANPY: large-scale single-cell gene expression data analysis},
author={Wolf, F Alexander and Angerer, Philipp and Theis, Fabian J},
journal={Genome biology},
volume={19},
number={1},
pages={1--5},
year={2018},
publisher={Springer}
}
ENCODE Consortium (Regulatory Data)
The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 2012.
Roadmap Epigenomics (Regulatory Data)
Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature, 2015.
Eukaryotic Promoter Database (Promoter Data)
DREOS, RenΓ©, et al. The eukaryotic promoter database in its 30th year: focus on non-vertebrate organisms. Nucleic Acids Research, 2017.
License
- Genomic Benchmarks: Apache 2.0
- 10x Genomics PBMC: Free for research use
- ENCODE/Roadmap data: Public domain