Skip to content

AI4H-Inspired Foundation Model Benchmarks

Standardized, AI4H-aligned benchmarks for genetics and brain imaging foundation models — designed to be runnable locally, with public, comparable results.

Run a toy benchmark Submit my results (eval.yaml) View leaderboards


What you can run right now

pip install -e .
python -m fmbench generate-toy-data
python -m fmbench run --suite SUITE-TOY-CLASS --model configs/model_dummy_classifier.yaml --out results/toy_run

You should get two concrete artifacts:

  • results/toy_run/report.md: a human-readable report
  • results/toy_run/eval.yaml: a machine-readable record for submission

Example eval.yaml (what you submit):

eval_id: SUITE-TOY-CLASS-dummy_classifier-YYYYMMDD-HHMMSS
benchmark_id: BM-TOY-CLASS
model_ids:
  candidate: my_model_id
dataset_id: DS-TOY-FMRI-CLASS
run_metadata:
  runner: fmbench
  suite_id: SUITE-TOY-CLASS
metrics:
  AUROC: 0.82
  Accuracy: 0.76
status: Completed

🔄 How submissions work (fully automated)

Our leaderboard updates automatically when you submit results — no manual review delay.

flowchart LR
    subgraph local ["🖥️ Your Machine (Private)"]
        A["python -m fmbench run"] --> B["report.md"]
        A --> C["eval.yaml"]
    end

    subgraph github ["☁️ GitHub (Public)"]
        C --> D["Open Issue\n+ paste YAML"]
        D --> E["🤖 Bot extracts\n& validates"]
        E --> F["Auto-commit\nto evals/"]
        F --> G["🏆 Leaderboard\nrebuilds"]
        G --> H["📄 Docs deploy\nto GitHub Pages"]
    end

    style local fill:#e8f5e9,stroke:#4caf50
    style github fill:#e3f2fd,stroke:#2196f3
  • Your model stays private


    Weights, code, and training data never leave your machine. You only share metrics + metadata.

  • Zero manual steps


    GitHub Actions validates your eval.yaml, commits it to the repo, and rebuilds the leaderboard automatically.

  • Minutes, not days


    From submission to leaderboard appearance: ~2-3 minutes (not weeks of review).

  • AI4H compliant


    Follows ITU/WHO FG-AI4H DEL3 standards for local evaluation with standardized reporting.

Submit your results →


What stays private vs what is shared

Item Shared publicly? Notes
Benchmark code This repository
Toy datasets toy_data/
Metrics + run metadata Submitted via eval.yaml
Model weights Never leave your machine
Model code Never leave your machine
Training data Never leave your machine

This matches the AI4H DEL3 idea of local evaluation with standardized reporting. See AI4H Alignment.


⚠️ Data disclaimer

This repo contains TOY DATA only

Full-scale datasets are NOT included. We provide small subsamples for pipeline testing.

  • Toy data (included): 100–27,000 samples for validating your integration
  • Full genomics data: Download from HuggingFace
  • Brain imaging data: Requires institutional access (UK Biobank, HCP, etc.)
Using Toy Data Using Full Data
✅ Verify pipeline works ✅ Get publishable metrics
⚠️ High variance metrics ✅ Stable, reproducible scores
⚠️ Not for publication ⚠️ Requires external download

See Data Sources for download links.


  • Pick a suite: start with SUITE-TOY-CLASS (toy fMRI-like classification).
  • Wrap your model locally: provide a small Python wrapper + a model config YAML.
  • Run: fmbench run (and optionally fmbench run-robustness).
  • Inspect outputs: report.md, then submit eval.yaml.

Start Here / Researcher Workflow


Robustness testing

Test how your model handles noise, artifacts, and perturbations:

python -m fmbench run-robustness \
    --model configs/model_dummy_classifier.yaml \
    --data toy_data/neuro/robustness \
    --out results/robustness_eval

This produces rAUC (Reverse Area Under Curve) scores quantifying output stability under perturbations like channel dropout, Gaussian noise, and temporal shifts.


Contributing

We welcome contributions! You can:


Documentation map