Flamingo¶

Overview¶

Type: Visual language model (VLM)
Architecture: Perceiver-augmented vision encoder + gated cross-attention over a causal LM
Modality: Images, videos, text (interleaved)
Primary use: Few-shot multimodal understanding and text generation

Purpose & Design Philosophy¶

Flamingo extends large language models to the visual domain by bridging frozen vision and language backbones with a lightweight Perceiver Resampler and gated cross-attention layers. Instead of fine-tuning on each downstream task, Flamingo is trained once on large-scale web multimodal data and then adapted via in-context examples, mirroring GPT-3-style few-shot prompting for text.^See arXiv:2204.14198

Key idea: keep powerful pretrained vision and language models intact, and add minimal, well- behaved connectors that enable multimodal reasoning without catastrophic forgetting.

Architecture Highlights¶

Vision encoder: Pretrained CLIP/OpenCLIP-style ViT (e.g., ViT-L/14) for images or video frames.
Perceiver Resampler: Converts variable-resolution feature maps into a fixed set of visual tokens via cross-attention from learnable latent queries.
Language model: Pretrained causal LM (e.g., Chinchilla/MPT/RedPajama), largely frozen.
GATED XATTN-DENSE layers: Inserted between LM blocks to cross-attend from text tokens to visual tokens, with tanh-gated residuals for stable training.
Interleaved inputs: Sequences of images/videos and text with <image> and <|endofchunk|> markers; image-causal masking ensures each text span only sees its associated images.

For implementation details, see the OpenFlamingo factory and Flamingo wrapper in the code walkthrough.

Integration Strategy¶

For Neuro-Omics KB¶

Flamingo is not a primary model in current neuro-omics experiments but serves as a design reference for:

Scan-conditioned report generation: Replace the CLIP encoder with brain encoders (BrainLM, BrainMT, Brain Harmony) so fMRI/sMRI tokens play the role of image tokens.
Multimodal adapters: Reuse the Perceiver Resampler concept for compressing high-dimensional brain features into a fixed number of tokens.
LLM semantic bridge: Use Flamingo-style gated cross-attention to inject brain/genetics embeddings into language models (see kb/model_cards/llm_semantic_bridge.yaml).

For ARPA-H Brain-Omics Models¶

Flamingo illustrates how to:

Keep foundation encoders frozen while adding small multimodal connectors.
Structure interleaved multimodal sequences that include context examples followed by a query.
Build few-shot-capable architectures without task-specific heads for every benchmark.

These patterns carry over to Brain–Omics–LLM stacks that must reason jointly over genetics, brain imaging, and clinical text.

Reference Materials¶

Knowledge Base Resources¶

Paper summary: docs/generated/kb_curated/papers-md/flamingo_2022.md
Paper card (YAML): kb/paper_cards/flamingo_2022.yaml
Code walkthrough: docs/code_walkthroughs/flamingo_walkthrough.md
Model card (YAML): kb/model_cards/flamingo.yaml

Original Sources¶

Official implementation: OpenFlamingo GitHub
Paper: Flamingo: a Visual Language Model for Few-Shot Learning (NeurIPS 2022)^arXiv:2204.14198