FM Benchmark Submission Guide
Welcome! This guide explains how to submit your Foundation Model's evaluation results to the leaderboard.
🔄 How it works (fully automated)
Our submission pipeline is fully automated — your results appear on the leaderboard within minutes, not days.
flowchart TB
subgraph local ["🖥️ Your Machine"]
A["1. Run fmbench locally"] --> B["2. Get eval.yaml"]
end
subgraph submit ["📤 Submit"]
B --> C["3. Open GitHub Issue"]
C --> D["4. Paste eval.yaml"]
end
subgraph auto ["🤖 Automated (GitHub Actions)"]
D --> E["5. Bot validates YAML"]
E --> F["6. Commits to evals/"]
F --> G["7. Rebuilds leaderboard"]
G --> H["8. Deploys to Pages"]
end
H --> I["🎉 Live on leaderboard!"]
style local fill:#e8f5e9,stroke:#4caf50
style submit fill:#fff3e0,stroke:#ff9800
style auto fill:#e3f2fd,stroke:#2196f3
No manual review required
The bot validates your submission automatically. If the YAML is valid, it's added immediately. If there's an error, you'll get feedback in the issue comments.
Quick Start
Option 1: Use the CLI + GitHub Issue (Recommended)
# 1. Clone and install
git clone https://github.com/allison-eunse/ai4h-inspired-fm-benchmark-hub.git
cd ai4h-inspired-fm-benchmark-hub
pip install -e .
# 2. Generate toy data for testing
python -m fmbench generate-toy-data
# 3. Run your model
python -m fmbench run \
--suite SUITE-TOY-CLASS \
--model path/to/your_model_config.yaml \
--out results/my_model_run
# 4. Submit via GitHub Issue (attach eval.yaml)
Submit here:
Open a pre-filled submission issue
Option 2: Manual YAML Submission (Pull Request)
Create evaluation YAML files manually and submit via PR.
Step-by-Step Guide
1. Create Your Model Configuration
Create a YAML file in configs/ describing your model:
# configs/model_my_awesome_fm.yaml
model_id: my_awesome_fm
name: "My Awesome Foundation Model"
version: "1.0.0"
# How to load the model
type: python_class
import_path: "my_package.models:MyModel"
init_kwargs:
checkpoint: "path/to/checkpoint.pt"
# Metadata
modality: ["fMRI", "EEG"] # or "genomics", "MRI"
paper: "https://arxiv.org/abs/xxxx.xxxxx"
code: "https://github.com/my-org/my-model"
license: "MIT"
2. Run Evaluation
Classification Task
python -m fmbench run \
--suite SUITE-NEURO-CLASS-001 \
--model configs/model_my_awesome_fm.yaml \
--out results/my_fm_neuro
Robustness Evaluation
python -m fmbench run-robustness \
--model configs/model_my_awesome_fm.yaml \
--data toy_data/neuro/robustness \
--out results/my_fm_robustness
3. Review Your Results
Check the generated files:
results/my_fm_neuro/
├── eval.yaml # Evaluation record (submit this!)
└── report.md # Human-readable report
What to submit (minimum required)
To get onto the leaderboard, you only need to submit one file: your eval.yaml.
Minimum required fields (anything else is optional but encouraged):
eval_id: unique run identifierbenchmark_id: must match an existing benchmark inbenchmarks/model_ids.candidate: your model ID (string)dataset_id: dataset identifier used for the runrun_metadata: at leastrunner, plus hardware/runtime if availablemetrics: task-appropriate metrics (e.g.,AUROC,Accuracy, robustness rAUCs)status:Completed(or explain failures)
Minimal example:
eval_id: SUITE-TOY-CLASS-my_model-2025-12-18-120000
benchmark_id: BM-TOY-CLASS
model_ids:
candidate: my_model_id
dataset_id: DS-TOY-FMRI-CLASS
run_metadata:
date: "2025-12-18"
runner: fmbench
suite_id: SUITE-TOY-CLASS
hardware: "1x A100 40GB"
runtime_seconds: 123.4
metrics:
AUROC: 0.82
Accuracy: 0.76
status: Completed
What a “good” submission includes (recommended):
- Exact command + config used (
suite_id,model_config,output_dir) - Robustness metrics when applicable (
dropout_rAUC,noise_rAUC, etc.) - Stratified metrics under
metrics.stratified(site/sex/age_group/etc.) - Optional: attach
report.mdfor quick human review
4. Submit via Pull Request
- Fork the repository
- Add your files:
configs/model_your_model.yaml- Model configurationevals/YOUR_EVAL_ID.yaml- Evaluation results- (Optional)
models/your_model.yaml- Detailed model metadata - Create PR with title:
[Submission] Your Model Name - Benchmark Name
Evaluation YAML Format
Your evals/*.yaml file should follow this format:
# evals/SUITE-NEURO-CLASS-my_fm-20240101-120000.yaml
eval_id: SUITE-NEURO-CLASS-my_fm-20240101-120000
benchmark_id: BM-FMRI-GRANULAR # Must match existing benchmark
model_ids:
candidate: my_awesome_fm
dataset_id: DS-TOY-FMRI-CLASS # Or your dataset
run_metadata:
date: "2024-01-01"
runner: "fmbench"
hardware: "NVIDIA A100 80GB"
runtime_seconds: 3600
# Overall metrics
metrics:
AUROC: 0.92
Accuracy: 0.88
F1-Score: 0.87
# Stratified metrics (for granular leaderboards)
stratified:
scanner:
Siemens:
AUROC: 0.94
Accuracy: 0.90
N: 150
GE:
AUROC: 0.89
Accuracy: 0.85
N: 120
Philips:
AUROC: 0.91
Accuracy: 0.87
N: 100
acquisition_type:
resting_state:
AUROC: 0.93
Accuracy: 0.89
N: 200
task_based:
AUROC: 0.90
Accuracy: 0.86
N: 170
site:
Site_A:
AUROC: 0.92
N: 100
Site_B:
AUROC: 0.91
N: 120
status: Completed
Stratification Categories
To appear in granular sub-leaderboards, include these stratified metrics:
For fMRI/Neuro
| Category | Values |
|---|---|
scanner |
Siemens, GE, Philips |
acquisition_type |
resting_state, task_based, naturalistic |
field_strength |
1.5T, 3T, 7T |
preprocessing |
fmriprep, hcp, conn, minimal |
site |
Your site names |
For Genomics
| Category | Values |
|---|---|
sequencing_platform |
Illumina, PacBio, ONT |
cell_type |
T-cell, B-cell, Monocyte, etc. |
tissue |
Blood, Brain, Liver, etc. |
For All Benchmarks
| Category | Values |
|---|---|
sex |
M, F |
age_group |
age_20-40, age_40-60, age_60-80, age_80-100 |
condition |
Healthy, Clinical, etc. |
ethnicity |
Your categories |
Report Quality Metrics (for Generation Tasks)
If your model generates reports/text, include these metrics:
metrics:
# Linguistic
bleu: 42.5
rouge_l: 0.68
bertscore: 0.89
# Clinical
clinical_accuracy: 0.92
finding_recall: 0.88
finding_precision: 0.91
hallucination_rate: 0.05 # Lower is better
omission_rate: 0.08 # Lower is better
# Aggregate
report_quality_score: 0.87
Robustness Metrics
For robustness evaluations, include:
metrics:
robustness_score: 0.78
dropout_rAUC: 0.82
noise_rAUC: 0.75
line_noise_rAUC: 0.80
perm_equivariance: 0.85
shift_sensitivity: 0.79
# For genomics
masking_rAUC: 0.77
expression_rAUC: 0.81
PR Checklist
Before submitting, ensure:
- [ ] Model config YAML is valid
- [ ] Eval YAML has correct
benchmark_id - [ ] Metrics include both overall and stratified results
- [ ] All required metadata is present
- [ ] Results are reproducible (include seed if applicable)
Questions?
- Open an Issue
- Check existing Discussions
Thank you for contributing to open FM benchmarking! 🎉