Skip to content

๐Ÿ† Foundation Model Leaderboards

Benchmark Hub Overview

๐Ÿ“Š 7 Benchmarks | ๐Ÿค– 12 Models | ๐Ÿ“ˆ 35 Evaluations

What is this? This page ranks AI models for healthcare applications. Higher-ranked models perform better on standardized tests.

How to read it: Each table shows models from best (๐Ÿฅ‡) to developing (๐Ÿ“ˆ). Click "How are scores calculated?" for details on what the numbers mean.

Example: what a real submission looks like

This is a real, end-to-end run using the built-in baseline model. Your submission should look like this: a local run that produces report.md + eval.yaml.

Model ID Suite / Benchmark Task AUROC dropout rAUC noise rAUC
dummy_classifier SUITE-TOY-CLASS / BM-TOY-CLASS Toy fMRI-like classification 0.5597 0.7760 0.7867

Artifacts: Example classification eval.yaml ยท Example classification report.md ยท Example robustness eval.yaml ยท Example robustness report.md


๐Ÿ“ Metric Cheat Sheet

Use this as a general reference for the metrics that appear on the leaderboards.

Area Under ROC Curve (AUROC)

  • What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
  • Typical range: 0.5 (random guessing) โ†’ 1.0 (perfect separation)
  • Example: An AUROC of 0.85 means the model correctly ranks a positive case higher than a negative case 85% of the time.

Accuracy

  • What it measures: The percentage of predictions the model got right
  • Typical range: 0.0 (all wrong) โ†’ 1.0 (all correct)
  • Example: An accuracy of 0.92 means the model correctly classified 92 out of every 100 samples.

F1 Score

  • What it measures: A balanced measure that considers both false alarms and missed cases
  • Typical range: 0.0 (poor) โ†’ 1.0 (perfect balance of precision and recall)
  • Example: An F1 of 0.85 indicates the model has a good balance between catching real cases and avoiding false alarms.

Correlation

  • What it measures: How closely the model's predictions match the actual values
  • Typical range: -1.0 (perfect inverse) โ†’ 0 (no relationship) โ†’ 1.0 (perfect match)
  • Example: A correlation of 0.78 means the model's outputs track reasonably well with the true values.

Robustness Score

  • What it measures: How stable and reliable the model is when data quality isn't perfect
  • Typical range: 0.0 (performance collapses with any noise) โ†’ 1.0 (completely stable)
  • Example: A robustness score of 0.82 means the model maintains most of its accuracy even when data has noise or missing values.

Report Quality Score

  • What it measures: An overall measure of how good the AI-generated medical reports are
  • Typical range: 0.0 (poor quality) โ†’ 1.0 (excellent quality)
  • Example: A score of 0.85 indicates the model generates reports that are mostly accurate, complete, and well-structured.

Clinical Accuracy

  • What it measures: Are the medical findings in the generated report actually correct?
  • Typical range: 0.0 (all findings wrong) โ†’ 1.0 (all findings correct)
  • Example: A clinical accuracy of 0.92 means 92% of the medical findings in the report are verified as correct.

Hallucination Rate

  • What it measures: How often the AI makes up information that isn't supported by the input data
  • Typical range: 0.0 (no hallucinations โ€” ideal) โ†’ 1.0 (everything is made up)
  • Example: A hallucination rate of 0.05 means only 5% of generated content is unsupported by the input โ€” quite good!

BERTScore

  • What it measures: How similar the generated text is to the reference text in meaning (not just exact words)
  • Typical range: 0.0 (completely different meaning) โ†’ 1.0 (semantically identical)
  • Example: A BERTScore of 0.87 indicates the generated report conveys very similar clinical meaning to the expert reference.

๐Ÿงญ Jump To


๐Ÿงฌ Genomics

๐ŸŽฏ Classification

DNA Enhancer Classification

*Benchmark for classifying DNA sequences as enhancers or non-enhancers. Enhancers are distal regulatory elements that activate gene expression. Accurate enhancer prediction is critical for understanding gene regulation and identifying disease-associated variants. *

                    ๐Ÿ†                    

              ๐Ÿฅ‡ HyenaDNA              
                 (0.788)                 
             โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—             
             โ•‘               โ•‘             
   ๐Ÿฅˆ Caduceus   โ•‘               โ•‘   ๐Ÿฅ‰  Evo 2     
      (0.745)      โ•‘               โ•‘      (0.745)      
  โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•               โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—  
  โ•‘                                       โ•‘  
โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•

6 models ranked by AUROC:

Rank Model Score Level Details
๐Ÿฅ‡ HyenaDNA ๐Ÿ‘‘ 0.7883 ๐Ÿ”ถ Fair DS-DNA-ENHANCER, 2025-12-18T21:03:03.285801
๐Ÿฅˆ Caduceus 0.7453 ๐Ÿ”ถ Fair Human Enhancers (Coh, 2025-12-19T12:00:12.636691
๐Ÿฅ‰ Evo 2 0.7453 ๐Ÿ”ถ Fair Human Enhancers (Coh, 2025-12-19T12:00:13.160707
๐Ÿ… DNABERT-2 0.7365 ๐Ÿ”ถ Fair Human Enhancers (Coh, 2025-12-18T18:44:24.678525
๐Ÿ… HyenaDNA 0.7365 ๐Ÿ”ถ Fair Human Enhancers (Coh, 2025-12-18T18:44:17.006557
๐ŸŽ–๏ธ kmer_k6 0.7365 ๐Ÿ”ถ Fair Human Enhancers (Coh, 2025-12-18T18:44:08.075706

Quick Comparison

๐Ÿฅ‡ HyenaDNA leads with AUROC = 0.7883

  • Gap to ๐Ÿฅˆ Caduceus: +0.0430
  • Score spread (best to worst): 0.0518
๐Ÿ“ How are scores calculated for this benchmark? (click to expand)

๐Ÿ“‚ What this leaderboard measures

  • Benchmark: BM-DNA-ENHANCER โ€” DNA Enhancer Classification
  • Domain: Genetics, Genomics - Regulatory Element Prediction
  • Task type: Classification
  • Datasets used in the table above:
  • DS-DNA-ENHANCER โ€” DS-DNA-ENHANCER
  • DS-DNA-ENHANCERS-COHN โ€” Human Enhancers (Cohn et al.)
  • Typical sample size in these runs: ~6250 samples (train + test combined)
  • Primary ranking metric: AUROC (the score column in the table)



๐ŸŽฏ Primary metric for this leaderboard

  • Metric: AUROC
  • What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
  • Typical range: 0.5 (random guessing) โ†’ 1.0 (perfect separation)

๐Ÿ”Ž For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.



๐Ÿง  How This Metric Fits This Task

Different tasks emphasize different aspects of performance.

Here's how this metric should be interpreted for this benchmark:


For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.

๐Ÿ’ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).



๐Ÿ“Š Performance Tiers

What Do the Scores Mean?

We group models into performance tiers to help you quickly understand how ready they are for different uses.


Score Range Rating Interpretation Suitable For
โ‰ฅ 0.90 โญ Excellent Top-tier, consistently reliable Clinical pilots (with oversight)
0.80 โ€“ 0.89 โœ… Good Strong performance, real promise Validation studies
0.70 โ€“ 0.79 ๐Ÿ”ถ Fair Moderate, has limitations Research only
< 0.70 ๐Ÿ“ˆ Developing Needs improvement Early research


Important Context

These thresholds are general guidelines.

The acceptable score depends on:

  • The specific clinical application
  • Risk level of the use case
  • Whether AI assists or replaces human judgment

Always consult domain experts when evaluating fitness for a particular use case.



๐Ÿ“ How We Determine Rankings

Models are ranked following these principles:


1๏ธโƒฃ Primary metric determines rank

The model with the highest score in the main metric ranks first.

For metrics where lower is better (like error rates), the lowest score wins.


2๏ธโƒฃ Ties are broken by secondary metrics

If two models have identical primary scores, we look at other relevant metrics.


3๏ธโƒฃ Best run per model

If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.


4๏ธโƒฃ Reproducibility required

All results must be reproducible. We record:

  • Evaluation date
  • Dataset used
  • Configuration details



๐Ÿฅ Why This Matters for Healthcare AI

Healthcare AI has higher stakes than many other AI applications.

A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.


That's why we:

โœ… Use multiple metrics to capture different aspects of performance

โœ… Test robustness to real-world data quality issues

โœ… Require transparency about evaluation conditions

โœ… Follow international standards for healthcare AI assessment



๐ŸŒ Standards Alignment

This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.


This ensures our evaluations are:

Quality What it means
Rigorous Following established scientific methodology
Comparable Using standardized metrics across models
Trustworthy Aligned with WHO/ITU recommendations



DNA Promoter Classification

*Benchmark for classifying DNA sequences as promoters or non-promoters. Promoters are regulatory regions at transcription start sites (TSS). This benchmark focuses on non-TATA promoters, which lack the canonical TATA box and represent ~75% of human promoters. *

                    ๐Ÿ†                    

              ๐Ÿฅ‡ HyenaDNA              
                 (0.872)                 
             โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—             
             โ•‘               โ•‘             
   ๐Ÿฅˆ  Evo 2     โ•‘               โ•‘   ๐Ÿฅ‰ Caduceus   
      (0.859)      โ•‘               โ•‘      (0.859)      
  โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•               โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—  
  โ•‘                                       โ•‘  
โ•โ•โ•ฉโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฉโ•โ•

6 models ranked by AUROC:

Rank Model Score Level Details
๐Ÿฅ‡ HyenaDNA ๐Ÿ‘‘ 0.8720 โœ… Good DS-DNA-PROMOTER, 2025-12-18T21:03:12.030852
๐Ÿฅˆ Evo 2 0.8594 โœ… Good Human Non-TATA Promo, 2025-12-19T12:00:13.671201
๐Ÿฅ‰ Caduceus 0.8594 โœ… Good Human Non-TATA Promo, 2025-12-19T12:00:12.829913
๐Ÿ… DNABERT-2 0.8357 โœ… Good Human Non-TATA Promo, 2025-12-18T18:44:27.391206
๐Ÿ… kmer_k6 0.8357 โœ… Good Human Non-TATA Promo, 2025-12-18T18:44:10.847321
๐ŸŽ–๏ธ HyenaDNA 0.8357 โœ… Good Human Non-TATA Promo, 2025-12-18T18:44:19.651418

Quick Comparison

๐Ÿฅ‡ HyenaDNA leads with AUROC = 0.8720

  • Gap to ๐Ÿฅˆ Evo 2: +0.0126
  • Score spread (best to worst): 0.0363
๐Ÿ“ How are scores calculated for this benchmark? (click to expand)

๐Ÿ“‚ What this leaderboard measures

  • Benchmark: BM-DNA-PROMOTER โ€” DNA Promoter Classification
  • Domain: Genetics, Genomics - Promoter Prediction
  • Task type: Classification
  • Datasets used in the table above:
  • DS-DNA-PROMOTER โ€” DS-DNA-PROMOTER
  • DS-DNA-PROMOTERS-NONTATA โ€” Human Non-TATA Promoters (EPD)
  • Typical sample size in these runs: ~6250 samples (train + test combined)
  • Primary ranking metric: AUROC (the score column in the table)



๐ŸŽฏ Primary metric for this leaderboard

  • Metric: AUROC
  • What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
  • Typical range: 0.5 (random guessing) โ†’ 1.0 (perfect separation)

๐Ÿ”Ž For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.



๐Ÿง  How This Metric Fits This Task

Different tasks emphasize different aspects of performance.

Here's how this metric should be interpreted for this benchmark:


For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.

๐Ÿ’ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).



๐Ÿ“Š Performance Tiers

What Do the Scores Mean?

We group models into performance tiers to help you quickly understand how ready they are for different uses.


Score Range Rating Interpretation Suitable For
โ‰ฅ 0.90 โญ Excellent Top-tier, consistently reliable Clinical pilots (with oversight)
0.80 โ€“ 0.89 โœ… Good Strong performance, real promise Validation studies
0.70 โ€“ 0.79 ๐Ÿ”ถ Fair Moderate, has limitations Research only
< 0.70 ๐Ÿ“ˆ Developing Needs improvement Early research


Important Context

These thresholds are general guidelines.

The acceptable score depends on:

  • The specific clinical application
  • Risk level of the use case
  • Whether AI assists or replaces human judgment

Always consult domain experts when evaluating fitness for a particular use case.



๐Ÿ“ How We Determine Rankings

Models are ranked following these principles:


1๏ธโƒฃ Primary metric determines rank

The model with the highest score in the main metric ranks first.

For metrics where lower is better (like error rates), the lowest score wins.


2๏ธโƒฃ Ties are broken by secondary metrics

If two models have identical primary scores, we look at other relevant metrics.


3๏ธโƒฃ Best run per model

If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.


4๏ธโƒฃ Reproducibility required

All results must be reproducible. We record:

  • Evaluation date
  • Dataset used
  • Configuration details



๐Ÿฅ Why This Matters for Healthcare AI

Healthcare AI has higher stakes than many other AI applications.

A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.


That's why we:

โœ… Use multiple metrics to capture different aspects of performance

โœ… Test robustness to real-world data quality issues

โœ… Require transparency about evaluation conditions

โœ… Follow international standards for healthcare AI assessment



๐ŸŒ Standards Alignment

This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.


This ensures our evaluations are:

Quality What it means
Rigorous Following established scientific methodology
Comparable Using standardized metrics across models
Trustworthy Aligned with WHO/ITU recommendations



Cell Type Annotation

Predicting cell types from single-cell RNA-seq data.

2 models ranked by AUROC:

Rank Model Score Level Details
๐Ÿฅ‡ Baseline (Random/Majority) ๐Ÿ‘‘ 0.0000 ๐Ÿ“ˆ Developing PBMC 3k (processed, , 2025-12-18
๐Ÿฅˆ geneformer 0.0000 ๐Ÿ“ˆ Developing PBMC 3k (processed, , 2025-12-18

Quick Comparison

๐Ÿฅ‡ Baseline (Random/Majority) leads with AUROC = 0.0000

  • Gap to ๐Ÿฅˆ geneformer: +0.0000
๐Ÿ“ How are scores calculated for this benchmark? (click to expand)

๐Ÿ“‚ What this leaderboard measures

  • Benchmark: BM-002 โ€” Cell Type Annotation
  • Domain: Genomics, Single-cell Transcriptomics
  • Task type: Classification
  • Datasets used in the table above:
  • DS-PBMC โ€” PBMC 3k (processed, with cell type labels)
  • Primary ranking metric: AUROC (the score column in the table)



๐ŸŽฏ Primary metric for this leaderboard

  • Metric: AUROC
  • What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
  • Typical range: 0.5 (random guessing) โ†’ 1.0 (perfect separation)

๐Ÿ”Ž For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.



๐Ÿง  How This Metric Fits This Task

Different tasks emphasize different aspects of performance.

Here's how this metric should be interpreted for this benchmark:


For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.

๐Ÿ’ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).



๐Ÿ“Š Performance Tiers

What Do the Scores Mean?

We group models into performance tiers to help you quickly understand how ready they are for different uses.


Score Range Rating Interpretation Suitable For
โ‰ฅ 0.90 โญ Excellent Top-tier, consistently reliable Clinical pilots (with oversight)
0.80 โ€“ 0.89 โœ… Good Strong performance, real promise Validation studies
0.70 โ€“ 0.79 ๐Ÿ”ถ Fair Moderate, has limitations Research only
< 0.70 ๐Ÿ“ˆ Developing Needs improvement Early research


Important Context

These thresholds are general guidelines.

The acceptable score depends on:

  • The specific clinical application
  • Risk level of the use case
  • Whether AI assists or replaces human judgment

Always consult domain experts when evaluating fitness for a particular use case.



๐Ÿ“ How We Determine Rankings

Models are ranked following these principles:


1๏ธโƒฃ Primary metric determines rank

The model with the highest score in the main metric ranks first.

For metrics where lower is better (like error rates), the lowest score wins.


2๏ธโƒฃ Ties are broken by secondary metrics

If two models have identical primary scores, we look at other relevant metrics.


3๏ธโƒฃ Best run per model

If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.


4๏ธโƒฃ Reproducibility required

All results must be reproducible. We record:

  • Evaluation date
  • Dataset used
  • Configuration details



๐Ÿฅ Why This Matters for Healthcare AI

Healthcare AI has higher stakes than many other AI applications.

A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.


That's why we:

โœ… Use multiple metrics to capture different aspects of performance

โœ… Test robustness to real-world data quality issues

โœ… Require transparency about evaluation conditions

โœ… Follow international standards for healthcare AI assessment



๐ŸŒ Standards Alignment

This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.


This ensures our evaluations are:

Quality What it means
Rigorous Following established scientific methodology
Comparable Using standardized metrics across models
Trustworthy Aligned with WHO/ITU recommendations



๐Ÿง  Brain Imaging (MRI/fMRI)

๐ŸŽฏ Classification

Toy Classification Benchmark

A toy benchmark for testing the pipeline.

2 models ranked by AUROC:

Rank Model Score Level Details
๐Ÿฅ‡ Baseline (Random/Majority) ๐Ÿ‘‘ 0.5597 ๐Ÿ“ˆ Developing Toy fMRI Classificat, 2025-11-27
๐Ÿฅˆ BrainLM 0.5193 ๐Ÿ“ˆ Developing Toy fMRI Classificat, 2025-11-27

Quick Comparison

๐Ÿฅ‡ Baseline (Random/Majority) leads with AUROC = 0.5597

  • Gap to ๐Ÿฅˆ BrainLM: +0.0404
๐Ÿ“ How are scores calculated for this benchmark? (click to expand)

๐Ÿ“‚ What this leaderboard measures

  • Benchmark: BM-TOY-CLASS โ€” Toy Classification Benchmark
  • Domain: Neurology
  • Task type: Classification
  • Datasets used in the table above:
  • DS-TOY-FMRI-CLASS โ€” Toy fMRI Classification
  • Primary ranking metric: AUROC (the score column in the table)



๐ŸŽฏ Primary metric for this leaderboard

  • Metric: AUROC
  • What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
  • Typical range: 0.5 (random guessing) โ†’ 1.0 (perfect separation)

๐Ÿ”Ž For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.



๐Ÿง  How This Metric Fits This Task

Different tasks emphasize different aspects of performance.

Here's how this metric should be interpreted for this benchmark:


For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.

๐Ÿ’ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).



๐Ÿ“Š Performance Tiers

What Do the Scores Mean?

We group models into performance tiers to help you quickly understand how ready they are for different uses.


Score Range Rating Interpretation Suitable For
โ‰ฅ 0.90 โญ Excellent Top-tier, consistently reliable Clinical pilots (with oversight)
0.80 โ€“ 0.89 โœ… Good Strong performance, real promise Validation studies
0.70 โ€“ 0.79 ๐Ÿ”ถ Fair Moderate, has limitations Research only
< 0.70 ๐Ÿ“ˆ Developing Needs improvement Early research


Important Context

These thresholds are general guidelines.

The acceptable score depends on:

  • The specific clinical application
  • Risk level of the use case
  • Whether AI assists or replaces human judgment

Always consult domain experts when evaluating fitness for a particular use case.



๐Ÿ“ How We Determine Rankings

Models are ranked following these principles:


1๏ธโƒฃ Primary metric determines rank

The model with the highest score in the main metric ranks first.

For metrics where lower is better (like error rates), the lowest score wins.


2๏ธโƒฃ Ties are broken by secondary metrics

If two models have identical primary scores, we look at other relevant metrics.


3๏ธโƒฃ Best run per model

If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.


4๏ธโƒฃ Reproducibility required

All results must be reproducible. We record:

  • Evaluation date
  • Dataset used
  • Configuration details



๐Ÿฅ Why This Matters for Healthcare AI

Healthcare AI has higher stakes than many other AI applications.

A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.


That's why we:

โœ… Use multiple metrics to capture different aspects of performance

โœ… Test robustness to real-world data quality issues

โœ… Require transparency about evaluation conditions

โœ… Follow international standards for healthcare AI assessment



๐ŸŒ Standards Alignment

This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.


This ensures our evaluations are:

Quality What it means
Rigorous Following established scientific methodology
Comparable Using standardized metrics across models
Trustworthy Aligned with WHO/ITU recommendations



๐Ÿ“‹ Classification/Reconstruction

fMRI Foundation Model Benchmark (Granular)

2 models ranked by AUROC:

Rank Model Score Level Details
๐Ÿฅ‡ Brain-JEPA ๐Ÿ‘‘ 1.0000 โญ Excellent DS-TOY-FMRI, 2025-12-19T12:00:49.427678
๐Ÿฅˆ BrainLM 1.0000 โญ Excellent DS-TOY-FMRI, 2025-12-19T12:00:49.423857

Quick Comparison

๐Ÿฅ‡ Brain-JEPA leads with AUROC = 1.0000

  • Gap to ๐Ÿฅˆ BrainLM: +0.0000
๐Ÿ“ How are scores calculated for this benchmark? (click to expand)

๐Ÿ“‚ What this leaderboard measures

  • Benchmark: BM-FMRI-GRANULAR โ€” fMRI Foundation Model Benchmark (Granular)
  • Domain: Neurology, Functional Brain Imaging Analysis
  • Task type: Classification/Reconstruction
  • Datasets used in the table above:
  • DS-TOY-FMRI โ€” DS-TOY-FMRI
  • Typical sample size in these runs: ~200 samples (train + test combined)
  • Primary ranking metric: AUROC (the score column in the table)



๐ŸŽฏ Primary metric for this leaderboard

  • Metric: AUROC
  • What it measures: Measures how well the model can tell apart different categories (e.g., healthy vs. diseased)
  • Typical range: 0.5 (random guessing) โ†’ 1.0 (perfect separation)

๐Ÿ”Ž For a full explanation of this and other metrics, see the Metric Cheat Sheet near the top of this page.



๐Ÿง  How This Metric Fits This Task

Different tasks emphasize different aspects of performance.

Here's how this metric should be interpreted for this benchmark:


For classification tasks (e.g., disease vs. no disease), this metric helps you understand how reliably the model separates different outcome groups.

๐Ÿ’ก Tip: In addition to raw accuracy, look at metrics like AUROC and F1 Score, especially when classes are imbalanced (when positive cases are rare).



๐Ÿ“Š Performance Tiers

What Do the Scores Mean?

We group models into performance tiers to help you quickly understand how ready they are for different uses.


Score Range Rating Interpretation Suitable For
โ‰ฅ 0.90 โญ Excellent Top-tier, consistently reliable Clinical pilots (with oversight)
0.80 โ€“ 0.89 โœ… Good Strong performance, real promise Validation studies
0.70 โ€“ 0.79 ๐Ÿ”ถ Fair Moderate, has limitations Research only
< 0.70 ๐Ÿ“ˆ Developing Needs improvement Early research


Important Context

These thresholds are general guidelines.

The acceptable score depends on:

  • The specific clinical application
  • Risk level of the use case
  • Whether AI assists or replaces human judgment

Always consult domain experts when evaluating fitness for a particular use case.



๐Ÿ“ How We Determine Rankings

Models are ranked following these principles:


1๏ธโƒฃ Primary metric determines rank

The model with the highest score in the main metric ranks first.

For metrics where lower is better (like error rates), the lowest score wins.


2๏ธโƒฃ Ties are broken by secondary metrics

If two models have identical primary scores, we look at other relevant metrics.


3๏ธโƒฃ Best run per model

If a model was evaluated multiple times (e.g., with different settings), only its best result appears on the leaderboard.


4๏ธโƒฃ Reproducibility required

All results must be reproducible. We record:

  • Evaluation date
  • Dataset used
  • Configuration details



๐Ÿฅ Why This Matters for Healthcare AI

Healthcare AI has higher stakes than many other AI applications.

A model that works 95% of the time might sound good, but that 5% could mean missed diagnoses or incorrect treatments.


That's why we:

โœ… Use multiple metrics to capture different aspects of performance

โœ… Test robustness to real-world data quality issues

โœ… Require transparency about evaluation conditions

โœ… Follow international standards for healthcare AI assessment



๐ŸŒ Standards Alignment

This benchmark follows the ITU/WHO Focus Group on AI for Health (FG-AI4H) framework.


This ensures our evaluations are:

Quality What it means
Rigorous Following established scientific methodology
Comparable Using standardized metrics across models
Trustworthy Aligned with WHO/ITU recommendations



๐Ÿ”„ Reconstruction

Brain Time-Series Modeling

Evaluating ability to reconstruct masked fMRI voxel time-series.

No submissions yet

Be the first! See Submission Guide

๐Ÿ“‹ Other Benchmarks

Foundation Model Robustness Evaluation

Rank Model Score Level Details
๐Ÿฅ‡ geneformer ๐Ÿ‘‘ 0.9995 โญ Excellent neuro/robustness, 2025-11-27
๐Ÿฅˆ BrainLM 0.9451 โญ Excellent DS-TOY-FMRI-ROBUSTNE, 2025-12-19T12:01:52.781177
๐Ÿฅ‰ Brain-JEPA 0.9377 โญ Excellent DS-TOY-FMRI-ROBUSTNE, 2025-12-19T12:01:52.789369
๐Ÿ… SWIFT 0.9234 โญ Excellent DS-TOY-FMRI-ROBUSTNE, 2025-12-18T21:25:36.388271
๐Ÿ… Baseline (Random/Majority) 0.7810 ๐Ÿ”ถ Fair neuro/robustness, 2025-11-27

๐Ÿš€ Add Your Model

Want your model on this leaderboard?

  1. Download the benchmark toolkit
  2. Run locally on your model (your code stays private!)
  3. Submit results via GitHub Issue

๐Ÿ“ฅ Get Started ๐Ÿ“– Submission Guide


Aligned with ITU/WHO FG-AI4H standards for healthcare AI evaluation.