๐ Foundation Model Leaderboards
Benchmark Hub Overview
๐ 7 Benchmarks | ๐ค 12 Models | ๐ 35 Evaluations
What is this? This page ranks AI models for healthcare applications. Higher-ranked models perform better on standardized tests.
How to read it: Each table shows models from best (๐ฅ) to developing (๐). Click "How are scores calculated?" for details on what the numbers mean.
Example: what a real submission looks like
This is a real, end-to-end run using the built-in baseline model. Your submission should look like this: a local run that produces report.md + eval.yaml.
| Model ID | Suite / Benchmark | Task | AUROC | dropout rAUC | noise rAUC |
|---|---|---|---|---|---|
dummy_classifier |
SUITE-TOY-CLASS / BM-TOY-CLASS |
Toy fMRI-like classification | 0.5597 | 0.7760 | 0.7867 |
Artifacts: Example classification eval.yaml ยท Example classification report.md ยท Example robustness eval.yaml ยท Example robustness report.md
๐ Metric Cheat Sheet
Use this as a general reference for the metrics that appear on the leaderboards.
Area Under ROC Curve (AUROC)
- What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
- Typical range: 0.5 (random guessing) โ 1.0 (perfect separation)
- Example: An AUROC of 0.85 means the model correctly ranks a positive case higher than a negative case 85% of the time.
Accuracy
- What it measures: The percentage of predictions the model got right
- Typical range: 0.0 (all wrong) โ 1.0 (all correct)
- Example: An accuracy of 0.92 means the model correctly classified 92 out of every 100 samples.
F1 Score
- What it measures: A balanced measure that considers both false alarms and missed cases
- Typical range: 0.0 (poor) โ 1.0 (perfect balance of precision and recall)
- Example: An F1 of 0.85 indicates the model has a good balance between catching real cases and avoiding false alarms.
Correlation
- What it measures: How closely the model's predictions match the actual values
- Typical range: -1.0 (perfect inverse) โ 0 (no relationship) โ 1.0 (perfect match)
- Example: A correlation of 0.78 means the model's outputs track reasonably well with the true values.
Robustness Score
- What it measures: How stable and reliable the model is when data quality isn't perfect
- Typical range: 0.0 (performance collapses with any noise) โ 1.0 (completely stable)
- Example: A robustness score of 0.82 means the model maintains most of its accuracy even when data has noise or missing values.
Report Quality Score
- What it measures: An overall measure of how good the AI-generated medical reports are
- Typical range: 0.0 (poor quality) โ 1.0 (excellent quality)
- Example: A score of 0.85 indicates the model generates reports that are mostly accurate, complete, and well-structured.
Clinical Accuracy
- What it measures: Are the medical findings in the generated report actually correct?
- Typical range: 0.0 (all findings wrong) โ 1.0 (all findings correct)
- Example: A clinical accuracy of 0.92 means 92% of the medical findings in the report are verified as correct.
Hallucination Rate
- What it measures: How often the AI makes up information that isn't supported by the input data
- Typical range: 0.0 (no hallucinations โ ideal) โ 1.0 (everything is made up)
- Example: A hallucination rate of 0.05 means only 5% of generated content is unsupported by the input โ quite good!
BERTScore
- What it measures: How similar the generated text is to the reference text in meaning (not just exact words)
- Typical range: 0.0 (completely different meaning) โ 1.0 (semantically identical)
- Example: A BERTScore of 0.87 indicates the generated report conveys very similar clinical meaning to the expert reference.
๐งญ Jump To
๐งฌ Genomics
๐ฏ Classification
DNA Enhancer Classification
*Benchmark for classifying DNA sequences as enhancers or non-enhancers. Enhancers are distal regulatory elements that activate gene expression. Accurate enhancer prediction is critical for understanding gene regulation and identifying disease-associated variants. *
๐
๐ฅ HyenaDNA
(0.788)
โโโโโโโโโโโโโโโโโ
โ โ
๐ฅ Caduceus โ โ ๐ฅ Evo 2
(0.745) โ โ (0.745)
โโโโโโโโโโโโโ โโโโโโโโโโโโโ
โ โ
โโโฉโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉโโ
6 models ranked by AUROC:
| Rank | Model | Score | Level | Details |
|---|---|---|---|---|
| ๐ฅ | HyenaDNA ๐ | 0.7883 | ๐ถ Fair | DS-DNA-ENHANCER, 2025-12-18T21:03:03.285801 |
| ๐ฅ | Caduceus | 0.7453 | ๐ถ Fair | Human Enhancers (Coh, 2025-12-19T12:00:12.636691 |
| ๐ฅ | Evo 2 | 0.7453 | ๐ถ Fair | Human Enhancers (Coh, 2025-12-19T12:00:13.160707 |
| ๐ | DNABERT-2 | 0.7365 | ๐ถ Fair | Human Enhancers (Coh, 2025-12-18T18:44:24.678525 |
| ๐ | HyenaDNA | 0.7365 | ๐ถ Fair | Human Enhancers (Coh, 2025-12-18T18:44:17.006557 |
| ๐๏ธ | kmer_k6 | 0.7365 | ๐ถ Fair | Human Enhancers (Coh, 2025-12-18T18:44:08.075706 |
Quick Comparison
๐ฅ HyenaDNA leads with AUROC = 0.7883
- Gap to ๐ฅ Caduceus: +0.0430
- Score spread (best to worst): 0.0518
๐ How are scores calculated for this benchmark? (click to expand)
๐ What this leaderboard measures
- Benchmark:
BM-DNA-ENHANCERโ DNA Enhancer Classification - Domain: Genetics, Genomics - Regulatory Element Prediction
- Task type: Classification
- Datasets used in the table above:
DS-DNA-ENHANCERโ DS-DNA-ENHANCERDS-DNA-ENHANCERS-COHNโ Human Enhancers (Cohn et al.)- Typical sample size in these runs: ~6250 samples (train + test combined)
- Primary ranking metric:
AUROC(the score column in the table)
๐ฏ Primary metric for this leaderboard
- Metric:
AUROC - What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
- Typical range: 0.5 (random guessing) โ 1.0 (perfect separation)
๐ For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.
๐ง How This Metric Fits This Task
Different tasks emphasize different aspects of performance.
Here's how this metric should be interpreted for this benchmark:
For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.
๐ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).
๐ Performance Tiers
What Do the Scores Mean?
We group models into performance tiers to help you quickly understand how ready they are for different uses.
| Score Range | Rating | Interpretation | Suitable For |
|---|---|---|---|
| โฅ 0.90 | โญ Excellent | Top-tier, consistently reliable | Clinical pilots (with oversight) |
| 0.80 โ 0.89 | โ Good | Strong performance, real promise | Validation studies |
| 0.70 โ 0.79 | ๐ถ Fair | Moderate, has limitations | Research only |
| < 0.70 | ๐ Developing | Needs improvement | Early research |
Important Context
These thresholds are general guidelines.
The acceptable score depends on:
- The specific clinical application
- Risk level of the use case
- Whether AI assists or replaces human judgment
Always consult domain experts when evaluating fitness for a particular use case.
๐ How We Determine Rankings
Models are ranked following these principles:
1๏ธโฃ Primary metric determines rank
The model with the highest score in the main metric ranks first.
For metrics where lower is better (like error rates), the lowest score wins.
2๏ธโฃ Ties are broken by secondary metrics
If two models have identical primary scores, we look at other relevant metrics.
3๏ธโฃ Best run per model
If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.
4๏ธโฃ Reproducibility required
All results must be reproducible. We record:
- Evaluation date
- Dataset used
- Configuration details
๐ฅ Why This Matters for Healthcare AI
Healthcare AI has higher stakes than many other AI applications.
A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.
That's why we:
โ Use multiple metrics to capture different aspects of performance
โ Test robustness to real-world data quality issues
โ Require transparency about evaluation conditions
โ Follow international standards for healthcare AI assessment
๐ Standards Alignment
This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.
This ensures our evaluations are:
| Quality | What it means |
|---|---|
| Rigorous | Following established scientific methodology |
| Comparable | Using standardized metrics across models |
| Trustworthy | Aligned with WHO/ITU recommendations |
DNA Promoter Classification
*Benchmark for classifying DNA sequences as promoters or non-promoters. Promoters are regulatory regions at transcription start sites (TSS). This benchmark focuses on non-TATA promoters, which lack the canonical TATA box and represent ~75% of human promoters. *
๐
๐ฅ HyenaDNA
(0.872)
โโโโโโโโโโโโโโโโโ
โ โ
๐ฅ Evo 2 โ โ ๐ฅ Caduceus
(0.859) โ โ (0.859)
โโโโโโโโโโโโโ โโโโโโโโโโโโโ
โ โ
โโโฉโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉโโ
6 models ranked by AUROC:
| Rank | Model | Score | Level | Details |
|---|---|---|---|---|
| ๐ฅ | HyenaDNA ๐ | 0.8720 | โ Good | DS-DNA-PROMOTER, 2025-12-18T21:03:12.030852 |
| ๐ฅ | Evo 2 | 0.8594 | โ Good | Human Non-TATA Promo, 2025-12-19T12:00:13.671201 |
| ๐ฅ | Caduceus | 0.8594 | โ Good | Human Non-TATA Promo, 2025-12-19T12:00:12.829913 |
| ๐ | DNABERT-2 | 0.8357 | โ Good | Human Non-TATA Promo, 2025-12-18T18:44:27.391206 |
| ๐ | kmer_k6 | 0.8357 | โ Good | Human Non-TATA Promo, 2025-12-18T18:44:10.847321 |
| ๐๏ธ | HyenaDNA | 0.8357 | โ Good | Human Non-TATA Promo, 2025-12-18T18:44:19.651418 |
Quick Comparison
๐ฅ HyenaDNA leads with AUROC = 0.8720
- Gap to ๐ฅ Evo 2: +0.0126
- Score spread (best to worst): 0.0363
๐ How are scores calculated for this benchmark? (click to expand)
๐ What this leaderboard measures
- Benchmark:
BM-DNA-PROMOTERโ DNA Promoter Classification - Domain: Genetics, Genomics - Promoter Prediction
- Task type: Classification
- Datasets used in the table above:
DS-DNA-PROMOTERโ DS-DNA-PROMOTERDS-DNA-PROMOTERS-NONTATAโ Human Non-TATA Promoters (EPD)- Typical sample size in these runs: ~6250 samples (train + test combined)
- Primary ranking metric:
AUROC(the score column in the table)
๐ฏ Primary metric for this leaderboard
- Metric:
AUROC - What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
- Typical range: 0.5 (random guessing) โ 1.0 (perfect separation)
๐ For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.
๐ง How This Metric Fits This Task
Different tasks emphasize different aspects of performance.
Here's how this metric should be interpreted for this benchmark:
For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.
๐ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).
๐ Performance Tiers
What Do the Scores Mean?
We group models into performance tiers to help you quickly understand how ready they are for different uses.
| Score Range | Rating | Interpretation | Suitable For |
|---|---|---|---|
| โฅ 0.90 | โญ Excellent | Top-tier, consistently reliable | Clinical pilots (with oversight) |
| 0.80 โ 0.89 | โ Good | Strong performance, real promise | Validation studies |
| 0.70 โ 0.79 | ๐ถ Fair | Moderate, has limitations | Research only |
| < 0.70 | ๐ Developing | Needs improvement | Early research |
Important Context
These thresholds are general guidelines.
The acceptable score depends on:
- The specific clinical application
- Risk level of the use case
- Whether AI assists or replaces human judgment
Always consult domain experts when evaluating fitness for a particular use case.
๐ How We Determine Rankings
Models are ranked following these principles:
1๏ธโฃ Primary metric determines rank
The model with the highest score in the main metric ranks first.
For metrics where lower is better (like error rates), the lowest score wins.
2๏ธโฃ Ties are broken by secondary metrics
If two models have identical primary scores, we look at other relevant metrics.
3๏ธโฃ Best run per model
If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.
4๏ธโฃ Reproducibility required
All results must be reproducible. We record:
- Evaluation date
- Dataset used
- Configuration details
๐ฅ Why This Matters for Healthcare AI
Healthcare AI has higher stakes than many other AI applications.
A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.
That's why we:
โ Use multiple metrics to capture different aspects of performance
โ Test robustness to real-world data quality issues
โ Require transparency about evaluation conditions
โ Follow international standards for healthcare AI assessment
๐ Standards Alignment
This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.
This ensures our evaluations are:
| Quality | What it means |
|---|---|
| Rigorous | Following established scientific methodology |
| Comparable | Using standardized metrics across models |
| Trustworthy | Aligned with WHO/ITU recommendations |
Cell Type Annotation
Predicting cell types from single-cell RNA-seq data.
2 models ranked by AUROC:
| Rank | Model | Score | Level | Details |
|---|---|---|---|---|
| ๐ฅ | Baseline (Random/Majority) ๐ | 0.0000 | ๐ Developing | PBMC 3k (processed, , 2025-12-18 |
| ๐ฅ | geneformer | 0.0000 | ๐ Developing | PBMC 3k (processed, , 2025-12-18 |
Quick Comparison
๐ฅ Baseline (Random/Majority) leads with AUROC = 0.0000
- Gap to ๐ฅ geneformer: +0.0000
๐ How are scores calculated for this benchmark? (click to expand)
๐ What this leaderboard measures
- Benchmark:
BM-002โ Cell Type Annotation - Domain: Genomics, Single-cell Transcriptomics
- Task type: Classification
- Datasets used in the table above:
DS-PBMCโ PBMC 3k (processed, with cell type labels)- Primary ranking metric:
AUROC(the score column in the table)
๐ฏ Primary metric for this leaderboard
- Metric:
AUROC - What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
- Typical range: 0.5 (random guessing) โ 1.0 (perfect separation)
๐ For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.
๐ง How This Metric Fits This Task
Different tasks emphasize different aspects of performance.
Here's how this metric should be interpreted for this benchmark:
For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.
๐ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).
๐ Performance Tiers
What Do the Scores Mean?
We group models into performance tiers to help you quickly understand how ready they are for different uses.
| Score Range | Rating | Interpretation | Suitable For |
|---|---|---|---|
| โฅ 0.90 | โญ Excellent | Top-tier, consistently reliable | Clinical pilots (with oversight) |
| 0.80 โ 0.89 | โ Good | Strong performance, real promise | Validation studies |
| 0.70 โ 0.79 | ๐ถ Fair | Moderate, has limitations | Research only |
| < 0.70 | ๐ Developing | Needs improvement | Early research |
Important Context
These thresholds are general guidelines.
The acceptable score depends on:
- The specific clinical application
- Risk level of the use case
- Whether AI assists or replaces human judgment
Always consult domain experts when evaluating fitness for a particular use case.
๐ How We Determine Rankings
Models are ranked following these principles:
1๏ธโฃ Primary metric determines rank
The model with the highest score in the main metric ranks first.
For metrics where lower is better (like error rates), the lowest score wins.
2๏ธโฃ Ties are broken by secondary metrics
If two models have identical primary scores, we look at other relevant metrics.
3๏ธโฃ Best run per model
If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.
4๏ธโฃ Reproducibility required
All results must be reproducible. We record:
- Evaluation date
- Dataset used
- Configuration details
๐ฅ Why This Matters for Healthcare AI
Healthcare AI has higher stakes than many other AI applications.
A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.
That's why we:
โ Use multiple metrics to capture different aspects of performance
โ Test robustness to real-world data quality issues
โ Require transparency about evaluation conditions
โ Follow international standards for healthcare AI assessment
๐ Standards Alignment
This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.
This ensures our evaluations are:
| Quality | What it means |
|---|---|
| Rigorous | Following established scientific methodology |
| Comparable | Using standardized metrics across models |
| Trustworthy | Aligned with WHO/ITU recommendations |
๐ง Brain Imaging (MRI/fMRI)
๐ฏ Classification
Toy Classification Benchmark
A toy benchmark for testing the pipeline.
2 models ranked by AUROC:
| Rank | Model | Score | Level | Details |
|---|---|---|---|---|
| ๐ฅ | Baseline (Random/Majority) ๐ | 0.5597 | ๐ Developing | Toy fMRI Classificat, 2025-11-27 |
| ๐ฅ | BrainLM | 0.5193 | ๐ Developing | Toy fMRI Classificat, 2025-11-27 |
Quick Comparison
๐ฅ Baseline (Random/Majority) leads with AUROC = 0.5597
- Gap to ๐ฅ BrainLM: +0.0404
๐ How are scores calculated for this benchmark? (click to expand)
๐ What this leaderboard measures
- Benchmark:
BM-TOY-CLASSโ Toy Classification Benchmark - Domain: Neurology
- Task type: Classification
- Datasets used in the table above:
DS-TOY-FMRI-CLASSโ Toy fMRI Classification- Primary ranking metric:
AUROC(the score column in the table)
๐ฏ Primary metric for this leaderboard
- Metric:
AUROC - What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
- Typical range: 0.5 (random guessing) โ 1.0 (perfect separation)
๐ For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.
๐ง How This Metric Fits This Task
Different tasks emphasize different aspects of performance.
Here's how this metric should be interpreted for this benchmark:
For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.
๐ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).
๐ Performance Tiers
What Do the Scores Mean?
We group models into performance tiers to help you quickly understand how ready they are for different uses.
| Score Range | Rating | Interpretation | Suitable For |
|---|---|---|---|
| โฅ 0.90 | โญ Excellent | Top-tier, consistently reliable | Clinical pilots (with oversight) |
| 0.80 โ 0.89 | โ Good | Strong performance, real promise | Validation studies |
| 0.70 โ 0.79 | ๐ถ Fair | Moderate, has limitations | Research only |
| < 0.70 | ๐ Developing | Needs improvement | Early research |
Important Context
These thresholds are general guidelines.
The acceptable score depends on:
- The specific clinical application
- Risk level of the use case
- Whether AI assists or replaces human judgment
Always consult domain experts when evaluating fitness for a particular use case.
๐ How We Determine Rankings
Models are ranked following these principles:
1๏ธโฃ Primary metric determines rank
The model with the highest score in the main metric ranks first.
For metrics where lower is better (like error rates), the lowest score wins.
2๏ธโฃ Ties are broken by secondary metrics
If two models have identical primary scores, we look at other relevant metrics.
3๏ธโฃ Best run per model
If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.
4๏ธโฃ Reproducibility required
All results must be reproducible. We record:
- Evaluation date
- Dataset used
- Configuration details
๐ฅ Why This Matters for Healthcare AI
Healthcare AI has higher stakes than many other AI applications.
A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.
That's why we:
โ Use multiple metrics to capture different aspects of performance
โ Test robustness to real-world data quality issues
โ Require transparency about evaluation conditions
โ Follow international standards for healthcare AI assessment
๐ Standards Alignment
This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.
This ensures our evaluations are:
| Quality | What it means |
|---|---|
| Rigorous | Following established scientific methodology |
| Comparable | Using standardized metrics across models |
| Trustworthy | Aligned with WHO/ITU recommendations |
๐ Classification/Reconstruction
fMRI Foundation Model Benchmark (Granular)
2 models ranked by AUROC:
| Rank | Model | Score | Level | Details |
|---|---|---|---|---|
| ๐ฅ | Brain-JEPA ๐ | 1.0000 | โญ Excellent | DS-TOY-FMRI, 2025-12-19T12:00:49.427678 |
| ๐ฅ | BrainLM | 1.0000 | โญ Excellent | DS-TOY-FMRI, 2025-12-19T12:00:49.423857 |
Quick Comparison
๐ฅ Brain-JEPA leads with AUROC = 1.0000
- Gap to ๐ฅ BrainLM: +0.0000
๐ How are scores calculated for this benchmark? (click to expand)
๐ What this leaderboard measures
- Benchmark:
BM-FMRI-GRANULARโ fMRI Foundation Model Benchmark (Granular) - Domain: Neurology, Functional Brain Imaging Analysis
- Task type: Classification/Reconstruction
- Datasets used in the table above:
DS-TOY-FMRIโ DS-TOY-FMRI- Typical sample size in these runs: ~200 samples (train + test combined)
- Primary ranking metric:
AUROC(the score column in the table)
๐ฏ Primary metric for this leaderboard
- Metric:
AUROC - What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
- Typical range: 0.5 (random guessing) โ 1.0 (perfect separation)
๐ For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.
๐ง How This Metric Fits This Task
Different tasks emphasize different aspects of performance.
Here's how this metric should be interpreted for this benchmark:
For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.
๐ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).
๐ Performance Tiers
What Do the Scores Mean?
We group models into performance tiers to help you quickly understand how ready they are for different uses.
| Score Range | Rating | Interpretation | Suitable For |
|---|---|---|---|
| โฅ 0.90 | โญ Excellent | Top-tier, consistently reliable | Clinical pilots (with oversight) |
| 0.80 โ 0.89 | โ Good | Strong performance, real promise | Validation studies |
| 0.70 โ 0.79 | ๐ถ Fair | Moderate, has limitations | Research only |
| < 0.70 | ๐ Developing | Needs improvement | Early research |
Important Context
These thresholds are general guidelines.
The acceptable score depends on:
- The specific clinical application
- Risk level of the use case
- Whether AI assists or replaces human judgment
Always consult domain experts when evaluating fitness for a particular use case.
๐ How We Determine Rankings
Models are ranked following these principles:
1๏ธโฃ Primary metric determines rank
The model with the highest score in the main metric ranks first.
For metrics where lower is better (like error rates), the lowest score wins.
2๏ธโฃ Ties are broken by secondary metrics
If two models have identical primary scores, we look at other relevant metrics.
3๏ธโฃ Best run per model
If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.
4๏ธโฃ Reproducibility required
All results must be reproducible. We record:
- Evaluation date
- Dataset used
- Configuration details
๐ฅ Why This Matters for Healthcare AI
Healthcare AI has higher stakes than many other AI applications.
A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.
That's why we:
โ Use multiple metrics to capture different aspects of performance
โ Test robustness to real-world data quality issues
โ Require transparency about evaluation conditions
โ Follow international standards for healthcare AI assessment
๐ Standards Alignment
This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.
This ensures our evaluations are:
| Quality | What it means |
|---|---|
| Rigorous | Following established scientific methodology |
| Comparable | Using standardized metrics across models |
| Trustworthy | Aligned with WHO/ITU recommendations |
๐ Reconstruction
Brain Time-Series Modeling
Evaluating ability to reconstruct masked fMRI voxel time-series.
No submissions yet
Be the first! See Submission Guide
๐ Other Benchmarks
Foundation Model Robustness Evaluation
| Rank | Model | Score | Level | Details |
|---|---|---|---|---|
| ๐ฅ | geneformer ๐ | 0.9995 | โญ Excellent | neuro/robustness, 2025-11-27 |
| ๐ฅ | BrainLM | 0.9451 | โญ Excellent | DS-TOY-FMRI-ROBUSTNE, 2025-12-19T12:01:52.781177 |
| ๐ฅ | Brain-JEPA | 0.9377 | โญ Excellent | DS-TOY-FMRI-ROBUSTNE, 2025-12-19T12:01:52.789369 |
| ๐ | SWIFT | 0.9234 | โญ Excellent | DS-TOY-FMRI-ROBUSTNE, 2025-12-18T21:25:36.388271 |
| ๐ | Baseline (Random/Majority) | 0.7810 | ๐ถ Fair | neuro/robustness, 2025-11-27 |
๐ Add Your Model
Want your model on this leaderboard?
- Download the benchmark toolkit
- Run locally on your model (your code stays private!)
- Submit results via GitHub Issue
๐ฅ Get Started ๐ Submission Guide
Aligned with ITU/WHO FG-AI4H standards for healthcare AI evaluation.