AI4H Alignment
Overview
This benchmark hub is explicitly designed to align with the ITU/WHO Focus Group on Artificial Intelligence for Health (FG-AI4H) standards and deliverables. This page documents how our framework implements and extends these standards for foundation model evaluation.
Quick mapping: deliverables → concrete implementation
| AI4H deliverable / concept | What it means (short) | Where it shows up in this hub |
|---|---|---|
| DEL3 — System Requirement Specifications (SyRS) | Define what an evaluation must do and how it’s validated | Benchmark definitions in benchmarks/*.yaml (functional requirements + metrics), and runner outputs (report.md, eval.yaml) produced by fmbench |
| DEL0.1 — Common unified terms | Shared vocabulary so runs are comparable | Consistent IDs/terms across YAMLs: benchmark_id, dataset_id, model_id, eval_id used in benchmarks/, datasets/, models/, evals/ and the docs |
| DEL10.8 — Neurology TDD | Neurology benchmark structure: topic, scope, inputs, metrics | Neurology-style benchmark schemas reflected in benchmarks/*.yaml and supported by stratified metrics in evals/*.yaml (metrics.stratified) |
| DEL7.x / test specifications (in practice) | A runnable test suite definition | Suites in tests/suite_*.yaml (e.g., SUITE-TOY-CLASS) that define how to run + what artifacts should be produced |
ITU FG-AI4H Background
The ITU/WHO Focus Group on AI for Health was established to develop international standards for AI in healthcare, focusing on:
- Safety: Ensuring AI systems are safe for clinical use
- Effectiveness: Establishing evidence-based validation methods
- Transparency: Promoting explainability and interpretability
- Ethics: Addressing fairness, bias, and patient rights
- Interoperability: Enabling cross-system compatibility
Key Deliverables Used
DEL0.1: Common Unified Terms
Reference: ITU-T FG-AI4H-DEL0.1 (2022)
Purpose: Establish standardized terminology for AI4H systems
Our Implementation:
| AI4H Term | Our Usage |
|---|---|
| AI Solution | Foundation models (e.g., BrainLM, Geneformer) |
| Benchmarking Run | Evaluation instance (eval_id in results) |
| Reference Implementation | Baseline models (logistic regression, random forest) |
| Health Topic | Domain area (e.g., "Functional Brain Imaging", "Genomics") |
| AI Task | ML task type (classification, reconstruction, regression) |
| Test Dataset | Standardized evaluation data (e.g., PBMC 3k, HCP fMRI) |
Example from our schema:
# From benchmarks/bm_fmri_granular.yaml
benchmark_id: BM-FMRI-GRANULAR
health_topic: Functional Brain Imaging Analysis
ai_task: Classification/Reconstruction
DEL3: AI4H Requirement Specifications
Reference: ITU-T FG-AI4H-DEL3 (2023)
Purpose: Define System Requirements Specification (SyRS) framework for AI4H systems
Our Implementation:
1. Functional Requirements (DEL3 Section 4)
We define functional requirements for each benchmark:
# Example: Cell type annotation benchmark
benchmark_id: CELL-TYPE-ANNOTATION
functional_requirements:
- input: Single-cell RNA-seq count matrix
- output: Cell type labels from standardized ontology
- performance_threshold: F1 > 0.80 (vs random baseline)
2. Performance Requirements (DEL3 Section 6)
Our leaderboards track multiple performance dimensions:
- Accuracy metrics: AUROC, F1-score, balanced accuracy
- Robustness: rAUC scores under perturbations
- Resource usage: Inference time, memory footprint
- Fairness: Stratified performance by demographic groups
Example:
# From fmbench/runners.py
metrics = {
'auroc': roc_auc_score(y_true, y_prob),
'accuracy': accuracy_score(y_true, y_pred),
'f1_score': f1_score(y_true, y_pred, average='weighted'),
'stratified': {
'by_age': compute_stratified_metrics(y_true, y_pred, age_groups),
'by_sex': compute_stratified_metrics(y_true, y_pred, sex_groups),
}
}
3. Data Requirements (DEL3 Section 4.2)
We enforce standardized data formats:
- Neuroimaging: NIfTI, preprocessed per fMRI specs
- Genomics: AnnData (scRNA-seq), FASTA (DNA), VCF (variants)
- Metadata: YAML schema with required fields
# From datasets/*.yaml
dataset_id: pbmc_68k
name: PBMC 68k
modality: scRNA-seq
n_samples: 68579
preprocessing: scanpy_1.9.1
quality_control:
min_genes_per_cell: 200
max_pct_mito: 5
4. Validation & Verification (DEL3 Section 5)
Our framework includes:
- ✅ Cross-validation: Stratified k-fold for robust estimates
- ✅ Baseline comparison: Always compare to random, majority, linear baselines
- ✅ Statistical significance: Permutation testing, confidence intervals
- ✅ Confound control: Partial correlations, matched controls
See our analysis recipes.
DEL10.8: Topic Description Document for Neurology
Reference: ITU-T FG-AI4H-DEL10.8 (2023)
Purpose: Define neurology-specific evaluation standards for the TG-Neuro topic group
Our Implementation:
Benchmark Structure (Following TDD Template)
Each neurology benchmark follows the TDD structure:
# Example: bm_fmri_granular.yaml
benchmark_id: BM-FMRI-GRANULAR
name: fMRI Foundation Model Benchmark (Granular)
# 1. Health Topic (TDD Section 2)
health_topic: Functional Brain Imaging Analysis
health_domain: Neurology
# 2. Scope (TDD Section 3)
scope:
clinical_context: "Evaluating FM robustness and representation quality"
population: General population
# 3. Input Data (TDD Section 4)
inputs:
dataset:
modality: fMRI
sequence: BOLD
preprocessing: fMRIPrep or HCP Pipelines
# 4. Evaluation Metrics (TDD Section 5)
metrics:
primary: AUROC
secondary:
- Accuracy
- F1-Score
- Robustness rAUC
stratification:
- scanner
- preprocessing_pipeline
- acquisition_type
# 5. Clinical Relevance (TDD Section 6)
clinical_relevance:
use_case: FM robustness and generalization testing
impact: Reliable brain imaging AI systems
Stratified Evaluation
DEL10.8 emphasizes evaluation across patient subgroups. Our framework automatically computes stratified metrics:
# From fmbench/runners.py
def compute_stratified_metrics(y_true, y_pred, groups):
"""
Compute metrics for each subgroup.
Aligns with DEL10.8 Section 5.3: Subgroup analysis.
"""
stratified = {}
for group_name in np.unique(groups):
mask = groups == group_name
if mask.sum() > 10: # Minimum sample size
stratified[group_name] = {
'accuracy': accuracy_score(y_true[mask], y_pred[mask]),
'n_samples': mask.sum(),
}
return stratified
Example output:
metrics:
accuracy: 0.85
auroc: 0.91
stratified:
by_age_group:
'55-65': {accuracy: 0.88, n_samples: 120}
'65-75': {accuracy: 0.85, n_samples: 200}
'75+': {accuracy: 0.80, n_samples: 80}
by_sex:
M: {accuracy: 0.84, n_samples: 180}
F: {accuracy: 0.86, n_samples: 220}
Extensions Beyond AI4H Standards
While we align with AI4H deliverables, we extend the framework to address foundation model-specific challenges:
1. Robustness Testing
Motivation: Foundation models must handle real-world data variability (noise, artifacts, missing data).
Our Framework: - Inspired by brainaug-lab methodology - Tests model resilience to controlled perturbations - Produces rAUC (Reverse Area Under Curve) scores
Probes: - Channel dropout (missing sensors) - Gaussian noise (SNR variation) - Line noise (50/60 Hz artifacts) - Channel permutation (equivariance test) - Temporal shift (timing jitter)
# Run robustness evaluation
python -m fmbench run-robustness \
--model configs/model_brainlm.yaml \
--data toy_data/neuro/robustness \
--out results/robustness_eval
2. Multi-Modal Evaluation
Challenge: Many foundation models integrate multiple data types (e.g., imaging + genomics).
Our Approach: - CCA-based cross-modal alignment testing - Multi-modal fusion benchmarks - Modality-specific and joint performance metrics
See: CCA & Permutation Testing
3. Interpretability & Explainability
Planned Features (aligned with DEL3 Section 8): - Attention map visualization - Feature attribution (SHAP, Integrated Gradients) - Embedding space interpretability
Compliance Checklist
Use this checklist to verify AI4H alignment for new benchmarks:
- [ ] Terminology: Uses standardized AI4H terms (DEL0.1)
- [ ] Health Topic: Clearly defined clinical context (DEL10.8 Section 2)
- [ ] Input Specs: Documented data format and preprocessing (DEL3 Section 4.2)
- [ ] Metrics: Primary and secondary metrics defined (DEL3 Section 6)
- [ ] Baselines: Comparison to reference implementations (DEL3 Section 7)
- [ ] Stratification: Performance across relevant subgroups (DEL10.8 Section 5.3)
- [ ] Clinical Relevance: Justification for clinical use case (DEL10.8 Section 6)
- [ ] Reproducibility: Code, data, and results publicly available
Governance & Contribution
Adding New Benchmarks
To propose a new benchmark aligned with AI4H standards:
- Define the Health Topic (following DEL10.8 template)
- Specify Input/Output (following DEL3 Section 4)
- Choose Metrics (primary + secondary, with clinical justification)
- Implement Reference Baselines (see Prediction Baselines)
- Document Clinical Relevance
- Submit PR with benchmark YAML + documentation
Citing AI4H Deliverables
When publishing results from this benchmark hub, please cite:
@techreport{itu_ai4h_del3_2023,
title = {AI4H requirement specifications},
author = {{ITU-T Focus Group on AI for Health}},
year = {2023},
institution = {International Telecommunication Union},
number = {FG-AI4H-DEL3},
url = {https://www.itu.int/dms_pub/itu-t/opb/fg/T-FG-AI4H-2023-11-PDF-E.pdf}
}
@techreport{itu_ai4h_del10_8_2023,
title = {Topic Description Document for the Topic Group on AI for neurological disorders (TG-Neuro)},
author = {{ITU-T Focus Group on AI for Health}},
year = {2023},
institution = {International Telecommunication Union},
number = {FG-AI4H-DEL10.8},
url = {https://www.itu.int/dms_pub/itu-t/opb/fg/T-FG-AI4H-2023-20-PDF-E.pdf}
}
References
-
ITU-T Focus Group on AI for Health. (2022). Common unified terms in artificial intelligence for health (DEL0.1). PDF
-
ITU-T Focus Group on AI for Health. (2023). AI4H requirement specifications (DEL3). PDF
-
ITU-T Focus Group on AI for Health. (2023). Topic Description Document for the Topic Group on AI for neurological disorders (TG-Neuro) (DEL10.8). PDF
-
Wiegand, T., et al. (2019). WHO and ITU establish benchmarking process for artificial intelligence in health. The Lancet, 394(10192), 9-11.
Contact & Feedback
For questions about AI4H alignment or to suggest improvements:
- GitHub Issues: Report an issue
- ITU FG-AI4H: Official website
© ITU 2025. AI4H deliverables are available under the Creative Commons Attribution-Non Commercial-Share Alike 3.0 IGO licence (CC BY-NC-SA 3.0 IGO).