AI4H-Inspired Foundation Model Benchmarks
Standardized, AI4H-aligned benchmarks for genetics and brain imaging foundation models — designed to be runnable locally, with public, comparable results.
Run a toy benchmark
Submit my results (eval.yaml)
View leaderboards
What you can run right now
pip install -e .
python -m fmbench generate-toy-data
python -m fmbench run --suite SUITE-TOY-CLASS --model configs/model_dummy_classifier.yaml --out results/toy_run
You should get two concrete artifacts:
results/toy_run/report.md: a human-readable reportresults/toy_run/eval.yaml: a machine-readable record for submission
Example eval.yaml (what you submit):
eval_id: SUITE-TOY-CLASS-dummy_classifier-YYYYMMDD-HHMMSS
benchmark_id: BM-TOY-CLASS
model_ids:
candidate: my_model_id
dataset_id: DS-TOY-FMRI-CLASS
run_metadata:
runner: fmbench
suite_id: SUITE-TOY-CLASS
metrics:
AUROC: 0.82
Accuracy: 0.76
status: Completed
🔄 How submissions work (fully automated)
Our leaderboard updates automatically when you submit results — no manual review delay.
flowchart LR
subgraph local ["🖥️ Your Machine (Private)"]
A["python -m fmbench run"] --> B["report.md"]
A --> C["eval.yaml"]
end
subgraph github ["☁️ GitHub (Public)"]
C --> D["Open Issue\n+ paste YAML"]
D --> E["🤖 Bot extracts\n& validates"]
E --> F["Auto-commit\nto evals/"]
F --> G["🏆 Leaderboard\nrebuilds"]
G --> H["📄 Docs deploy\nto GitHub Pages"]
end
style local fill:#e8f5e9,stroke:#4caf50
style github fill:#e3f2fd,stroke:#2196f3
-
Your model stays private
Weights, code, and training data never leave your machine. You only share metrics + metadata.
-
Zero manual steps
GitHub Actions validates your
eval.yaml, commits it to the repo, and rebuilds the leaderboard automatically. -
Minutes, not days
From submission to leaderboard appearance: ~2-3 minutes (not weeks of review).
-
AI4H compliant
Follows ITU/WHO FG-AI4H DEL3 standards for local evaluation with standardized reporting.
What stays private vs what is shared
| Item | Shared publicly? | Notes |
|---|---|---|
| Benchmark code | ✅ | This repository |
| Toy datasets | ✅ | toy_data/ |
| Metrics + run metadata | ✅ | Submitted via eval.yaml |
| Model weights | ❌ | Never leave your machine |
| Model code | ❌ | Never leave your machine |
| Training data | ❌ | Never leave your machine |
This matches the AI4H DEL3 idea of local evaluation with standardized reporting. See AI4H Alignment.
⚠️ Data disclaimer
This repo contains TOY DATA only
Full-scale datasets are NOT included. We provide small subsamples for pipeline testing.
- Toy data (included): 100–27,000 samples for validating your integration
- Full genomics data: Download from HuggingFace
- Brain imaging data: Requires institutional access (UK Biobank, HCP, etc.)
| Using Toy Data | Using Full Data |
|---|---|
| ✅ Verify pipeline works | ✅ Get publishable metrics |
| ⚠️ High variance metrics | ✅ Stable, reproducible scores |
| ⚠️ Not for publication | ⚠️ Requires external download |
See Data Sources for download links.
Start here (recommended workflow)
- Pick a suite: start with
SUITE-TOY-CLASS(toy fMRI-like classification). - Wrap your model locally: provide a small Python wrapper + a model config YAML.
- Run:
fmbench run(and optionallyfmbench run-robustness). - Inspect outputs:
report.md, then submiteval.yaml.
Start Here / Researcher Workflow
Robustness testing
Test how your model handles noise, artifacts, and perturbations:
python -m fmbench run-robustness \
--model configs/model_dummy_classifier.yaml \
--data toy_data/neuro/robustness \
--out results/robustness_eval
This produces rAUC (Reverse Area Under Curve) scores quantifying output stability under perturbations like channel dropout, Gaussian noise, and temporal shifts.
Contributing
We welcome contributions! You can:
- Submit benchmark results: Submission Guide
- Propose new protocols: Open a Discussion
- Add model adapters: See Models
Documentation map
- Leaderboards: Leaderboards
- Submit results: Submission Guide
- Models catalog: Models
- Data specifications: fMRI, sMRI, Genomics
- Protocols (recipes): CCA & permutation, Prediction baselines, Partial correlations
- Design / standards: AI4H alignment