Skip to content

Data schemas

genetics_embeddings.parquet

  • eid
  • embedding_dim
  • source_model
  • layer
  • vector

dnabert2_embedding_tree/ (Yoon/GENNIElab UKB exports)

Directory-style export used by the Yoon/GENNIElab DNABERT‑2 UKB embedding drops (Google Drive).

At embedding root: - iids.npy (shape (N,), dtype str/int): subject IDs in row order - labels.npy (shape (N,), dtype int): aligned labels - covariates_age.npy (shape (N,)) - covariates_sex.npy (shape (N,))

Per gene folder: - <EMBED_ROOT>/<GENE>/embeddings_{k}_layer_last.npy (float32, shape (n_k, F)) - Typical: F=768, k=1..49, and Σ_k n_k = N

Alignment contract (critical): - Row riids[r] and the same r aligns with labels/age/sex. - Chunk row counts match across genes → row order is stable across genes.

Re-check with python scripts/check_embedding_alignment.py --embed-root ... --gene-list ... --n-chunks 49.

brain_idps.parquet

  • eid
  • site
  • modality (sMRI or fMRI)
  • Selected IDPs:
  • sMRI: FreeSurfer 7.x aparc.stats (cortical thickness, ~68 regions) + aseg.stats (subcortical volumes, ~40 structures) + surface area → ~176 features
  • fMRI: Parcel-wise BOLD statistics (e.g., ROI mean), connectivity matrices (optional), or direct FM embeddings (BrainLM, Brain-JEPA)
  • Confounds:
  • intracranial_volume (sMRI)
  • mean_fd (mean framewise displacement, fMRI)
  • tsnr (temporal SNR, fMRI)
  • euler_number (FreeSurfer QC metric, sMRI)

ukb_fmri_roi_mean.npy

Simple baseline representation built from UKB ROI time series.

  • fmri_eids_180.npy (shape (N_fmri,), dtype int/str): EID per row
  • fmri_X_180.npy (shape (N_fmri, 180), dtype float32): ROI feature vector per subject (time‑mean), HCP MMP1 180‑ROI layout

Optional variants: - fmri_eids_360.npy / fmri_X_360.npy for native 360‑d variants - If mixing 180/360 sources, a controlled fallback can map 360→180 by reshaping (360,)→(2,180) and averaging halves.

participants.parquet

  • eid
  • age
  • sex
  • income_bin
  • pcs_1..pcs_10
  • site
  • mdd_label

splits.json

  • fold_id
  • train / val / test EID lists
  • seed
  • created_at

Validation: - Dataset cards: python scripts/manage_kb.py validate datasets - Yoon/GENNIElab DNABERT‑2 export roots: python scripts/check_embedding_alignment.py ...