Data schemas¶
genetics_embeddings.parquet¶
eidembedding_dimsource_modellayervector
dnabert2_embedding_tree/ (Yoon/GENNIElab UKB exports)¶
Directory-style export used by the Yoon/GENNIElab DNABERT‑2 UKB embedding drops (Google Drive).
At embedding root:
- iids.npy (shape (N,), dtype str/int): subject IDs in row order
- labels.npy (shape (N,), dtype int): aligned labels
- covariates_age.npy (shape (N,))
- covariates_sex.npy (shape (N,))
Per gene folder:
- <EMBED_ROOT>/<GENE>/embeddings_{k}_layer_last.npy (float32, shape (n_k, F))
- Typical: F=768, k=1..49, and Σ_k n_k = N
Alignment contract (critical):
- Row r ↔ iids[r] and the same r aligns with labels/age/sex.
- Chunk row counts match across genes → row order is stable across genes.
Re-check with python scripts/check_embedding_alignment.py --embed-root ... --gene-list ... --n-chunks 49.
brain_idps.parquet¶
eidsitemodality(sMRI or fMRI)- Selected IDPs:
- sMRI: FreeSurfer 7.x
aparc.stats(cortical thickness, ~68 regions) +aseg.stats(subcortical volumes, ~40 structures) + surface area → ~176 features - fMRI: Parcel-wise BOLD statistics (e.g., ROI mean), connectivity matrices (optional), or direct FM embeddings (BrainLM, Brain-JEPA)
- Confounds:
intracranial_volume(sMRI)mean_fd(mean framewise displacement, fMRI)tsnr(temporal SNR, fMRI)euler_number(FreeSurfer QC metric, sMRI)
ukb_fmri_roi_mean.npy¶
Simple baseline representation built from UKB ROI time series.
fmri_eids_180.npy(shape(N_fmri,), dtype int/str): EID per rowfmri_X_180.npy(shape(N_fmri, 180), dtype float32): ROI feature vector per subject (time‑mean), HCP MMP1 180‑ROI layout
Optional variants:
- fmri_eids_360.npy / fmri_X_360.npy for native 360‑d variants
- If mixing 180/360 sources, a controlled fallback can map 360→180 by reshaping (360,)→(2,180) and averaging halves.
participants.parquet¶
eidagesexincome_binpcs_1..pcs_10sitemdd_label
splits.json¶
fold_idtrain/val/testEID listsseedcreated_at
Validation:
- Dataset cards: python scripts/manage_kb.py validate datasets
- Yoon/GENNIElab DNABERT‑2 export roots: python scripts/check_embedding_alignment.py ...