AI4H-Inspired Foundation Model Benchmarks

Standardized, AI4H-aligned benchmarks for genetics and brain imaging foundation models — designed to be runnable locally, with public, comparable results.

Run a toy benchmark Submit my results (eval.yaml) View leaderboards

What you can run right now

pip install -e .
python -m fmbench generate-toy-data
python -m fmbench run --suite SUITE-TOY-CLASS --model configs/model_dummy_classifier.yaml --out results/toy_run

You should get two concrete artifacts:

results/toy_run/report.md: a human-readable report
results/toy_run/eval.yaml: a machine-readable record for submission

Example eval.yaml (what you submit):

eval_id: SUITE-TOY-CLASS-dummy_classifier-YYYYMMDD-HHMMSS
benchmark_id: BM-TOY-CLASS
model_ids:
  candidate: my_model_id
dataset_id: DS-TOY-FMRI-CLASS
run_metadata:
  runner: fmbench
  suite_id: SUITE-TOY-CLASS
metrics:
  AUROC: 0.82
  Accuracy: 0.76
status: Completed

🔄 How submissions work (fully automated)

Our leaderboard updates automatically when you submit results — no manual review delay.

flowchart LR
    subgraph local ["🖥️ Your Machine (Private)"]
        A["python -m fmbench run"] --> B["report.md"]
        A --> C["eval.yaml"]
    end

    subgraph github ["☁️ GitHub (Public)"]
        C --> D["Open Issue\n+ paste YAML"]
        D --> E["🤖 Bot extracts\n& validates"]
        E --> F["Auto-commit\nto evals/"]
        F --> G["🏆 Leaderboard\nrebuilds"]
        G --> H["📄 Docs deploy\nto GitHub Pages"]
    end

    style local fill:#e8f5e9,stroke:#4caf50
    style github fill:#e3f2fd,stroke:#2196f3

Your model stays private

Weights, code, and training data never leave your machine. You only share metrics + metadata.
Zero manual steps

GitHub Actions validates your eval.yaml, commits it to the repo, and rebuilds the leaderboard automatically.
Minutes, not days

From submission to leaderboard appearance: ~2-3 minutes (not weeks of review).
AI4H compliant

Follows ITU/WHO FG-AI4H DEL3 standards for local evaluation with standardized reporting.

Submit your results →

What stays private vs what is shared

Item	Shared publicly?	Notes
Benchmark code	✅	This repository
Toy datasets	✅	`toy_data/`
Metrics + run metadata	✅	Submitted via `eval.yaml`
Model weights	❌	Never leave your machine
Model code	❌	Never leave your machine
Training data	❌	Never leave your machine

This matches the AI4H DEL3 idea of local evaluation with standardized reporting. See AI4H Alignment.

⚠️ Data disclaimer

This repo contains TOY DATA only

Full-scale datasets are NOT included. We provide small subsamples for pipeline testing.

Toy data (included): 100–27,000 samples for validating your integration
Full genomics data: Download from HuggingFace
Brain imaging data: Requires institutional access (UK Biobank, HCP, etc.)

Using Toy Data	Using Full Data
✅ Verify pipeline works	✅ Get publishable metrics
⚠️ High variance metrics	✅ Stable, reproducible scores
⚠️ Not for publication	⚠️ Requires external download

See Data Sources for download links.

Start here (recommended workflow)

Pick a suite: start with SUITE-TOY-CLASS (toy fMRI-like classification).
Wrap your model locally: provide a small Python wrapper + a model config YAML.
Run: fmbench run (and optionally fmbench run-robustness).
Inspect outputs: report.md, then submit eval.yaml.

Start Here / Researcher Workflow

Robustness testing

Test how your model handles noise, artifacts, and perturbations:

python -m fmbench run-robustness \
    --model configs/model_dummy_classifier.yaml \
    --data toy_data/neuro/robustness \
    --out results/robustness_eval

This produces rAUC (Reverse Area Under Curve) scores quantifying output stability under perturbations like channel dropout, Gaussian noise, and temporal shifts.

Contributing

We welcome contributions! You can:

Submit benchmark results: Submission Guide
Propose new protocols: Open a Discussion
Add model adapters: See Models

Documentation map

Leaderboards: Leaderboards
Submit results: Submission Guide
Models catalog: Models
Data specifications: fMRI, sMRI, Genomics
Protocols (recipes): CCA & permutation, Prediction baselines, Partial correlations
Design / standards: AI4H alignment