Me-LLaMA Code Walkthrough¶
KB references: Me-LLaMA paper note
Overview¶
Me-LLaMA is a suite of open-source medical foundation models (13B/70B) developed through continual pre-training and instruction tuning of LLaMA2. It leverages a heterogeneous corpus of biomedical literature (PubMed), clinical notes (MIMIC-IV/MIMIC-CXR), and general domain data to balance domain specificity with general reasoning. The repository provides evaluation scripts, training recipes, and inference examples for both base and chat-aligned versions.^[text title="external_repos/me-lamma/README.md] (lines 33-36)"
At-a-Glance¶
| Architecture | Params / Scale | Context | Inputs | Key capabilities | Repo |
|---|---|---|---|---|---|
| LLaMA-2 based (Continual Pre-training + LoRA Tuning) | 13B & 70B parameters | 129B tokens mixed corpus (15:1:4 biomedical:clinical:general) | Text (clinical notes, papers, guidelines) | Medical reasoning, instruction following, zero-shot evaluation on PubMedQA/MedQA. | GitHub / PhysioNet |
Environment & Hardware Notes¶
- Installation: Requires
torchandtransformers. Evaluation dependencies managed viapoetryinsrc/medical-evaluation.^[text title="external_repos/me-lamma/README.md] (lines 144-152)" - Compute: Developed on A100 GPUs (160x for pre-training) and H100 GPUs (8x for tuning). Local inference runs on standard GPU setups via Hugging Face pipelines.^[
text title="external_repos/me-lamma/README.md] (lines 80-89)" - Access: Models require PhysioNet credentialed access; datasets available via Hugging Face collection.^[
text title="external_repos/me-lamma/README.md] (lines 41-44)"
Key Components¶
Training Pipeline (README.md)¶
The training strategy emphasizes continual pre-training followed by instruction tuning:
1. Continual Pre-training: 129B tokens mixed from PubMed (15), MIMIC (1), and RedPajama (4). Uses AdamW, cosine scheduler (0.05 warmup), and bf16 precision with DeepSpeed parallelism.
2. Instruction Tuning: 214K samples trained for 3 epochs using LoRA parameter-efficient fine-tuning on H100s.
This approach mitigates catastrophic forgetting while injecting specialized medical knowledge.^[text title="external_repos/me-lamma/README.md] (lines 64-91)"
Inference Stack (README.md)¶
Inference is standard Hugging Face transformers. The README provides snippets for both high-level pipeline usage and low-level AutoModelForCausalLM control:
Basic Generation:
from transformers import pipeline
pipe = pipeline("text-generation", model="path/to/Me-LLaMA")
print(pipe("The medical condition is characterized by", num_return_sequences=1))
Granular Control:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("path/to/Me-LLaMA")
model = AutoModelForCausalLM.from_pretrained("path/to/Me-LLaMA")
input_ids = tokenizer("[INPUT]", return_tensors="pt").input_ids
gen = model.generate(input_ids, max_length=50)
print(tokenizer.decode(gen[0]))
text title="external_repos/me-lamma/README.md] (lines 93-138)"
Evaluation Harness (src/medical-evaluation)¶
The repository includes a robust evaluation suite using poetry and src/eval.py. It supports:
- Hugging Face Models: Evaluate local or Hub models (e.g., hf-causal-vllm) against medical benchmarks (PUBMEDQA, MedQA, BioNLI, etc.).
- Commercial APIs: Compare against GPT-4 by swapping the model argument.
- Metrics: Includes BARTScore integration (src/metrics/BARTScore).
Run Example:
poetry run python src/eval.py \
--model "hf-causal-vllm" \
--model_args "pretrained=meta-llama/Llama-2-7b-chat-hf" \
--tasks "PUBMEDQA,MedQA,MedMCQA,..."
text title="external_repos/me-lamma/README.md] (lines 139-190)"
Integration Hooks¶
- Benchmark Alignment: Use the task list in
scripts/run_evaluation.sh("PUBMEDQA,MedQA...") as a standard checklist for evaluating new KB models. - Dataset Collection: The Hugging Face collection referenced is a valuable resource for populating
kb/datasets/. - Baseline Comparisons: Use the provided GPT-4 evaluation scripts to establish strong baselines for neuro-omics tasks.