Flamingo: a Visual Language Model for Few-Shot Learning¶

Authors: Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, et al. (DeepMind)
Year: 2022
Venue: NeurIPS 2022

1. Classification¶

Domain Category:
Vision / VLM / Multimodal FM
Flamingo is a visual language model (VLM) that integrates vision and language for few-shot learning on image and video understanding tasks.
FM Usage Type:
Core FM development + Multimodal FM or cross-modal integration
Key Modalities:
Images (high-resolution from web data)
Videos (short clips, average 22 seconds)
Text (interleaved captions, questions, answers)

2. Executive Summary¶

Flamingo is a family of Visual Language Models (VLMs) that achieve state-of-the-art few-shot learning on image and video understanding tasks by being prompted with a few input/output examples—analogous to GPT-3's few-shot text learning. The model bridges pretrained vision and language models through novel architectural components: a Perceiver Resampler that converts variable-size visual features into fixed visual tokens, and GATED XATTN-DENSE layers that condition frozen language models on visual representations via gated cross-attention. Flamingo handles arbitrarily interleaved sequences of images/videos and text, enabling natural few-shot prompting. Trained on billions of web-scraped multimodal examples (interleaved image-text from webpages, image-text pairs, video-text pairs) without task-specific annotations, a single Flamingo model achieves new state-of-the-art few-shot performance on 16 diverse benchmarks and outperforms fine-tuned models on 6 tasks despite using only 32 examples (1000× less data). The largest model (Flamingo-80B) sets new records on VQA and captioning tasks.

3. Problem Setup and Motivation¶

Scientific / practical problem:
Current vision-language models require extensive task-specific fine-tuning with thousands of annotated examples.
Contrastive models (CLIP) enable zero-shot classification but lack generative capabilities for open-ended tasks like captioning and VQA.
Goal: Build a model that rapidly adapts to new vision-language tasks using only a few examples, similar to GPT-3's few-shot learning for text.
Why this is hard:
Bridging vision and language: Vision encoders and language models are trained separately; connecting them effectively while preserving both pretrained knowledges is non-trivial.
Handling interleaved multimodal sequences: Few-shot learning requires processing sequences like (image₁, text₁), (image₂, text₂), ..., (query_image, ?).
Variable-size visual inputs: Images and videos have variable resolutions; language models expect fixed-size token sequences.
Large-scale training data: Few-shot learning requires massive pretraining on diverse multimodal data (billions of examples).
Training stability: Combining frozen pretrained models with new trainable components requires careful initialization and gating mechanisms.

4. Data and Modalities¶

Pretraining data:
M3W (MultiModal MassiveWeb): ~43M webpages with interleaved images and text (up to 5 images per sequence, 256 tokens).
ALIGN: 1.8B image-text pairs with alt-text descriptions.
LTIP (Long Text & Image Pairs): 312M image-text pairs with longer, higher-quality descriptions.
VTP (Video & Text Pairs): 27M short videos (average 22 seconds) with sentence descriptions.
Modalities:
Images: High-resolution images from webpages and image-text pairs.
Videos: Short video clips (1 FPS sampling) with temporal embeddings.
Text: Captions, questions, answers, descriptions, interleaved with visual content.
Preprocessing / representation:
Vision encoder: Pretrained NFNet-F6 (NormalizerFree ResNet) with contrastive pretraining; outputs 2D spatial grid flattened to 1D sequence.
Perceiver Resampler: Converts variable number of visual features to fixed 64 visual tokens using learned latent queries.
Text: Tokenized using language model's tokenizer; special tokens <image>, <EOC> (end of chunk).

5. Model / Foundation Model¶

Model Type:
Multimodal autoregressive language model that generates text conditioned on interleaved visual and textual inputs.
Is it a new FM or an existing one?
New FM. Flamingo introduces a new family of VLMs with specific architectural innovations for few-shot learning.
Key components and innovations:

Aspect	Details
Vision encoder	Pretrained NFNet-F6 (frozen) with contrastive pretraining
Perceiver Resampler	Converts variable visual features → fixed 64 visual tokens via learned queries
Language model	Chinchilla LM (1.4B, 7B, 70B) - frozen
GATED XATTN-DENSE layers	Interleaved between LM layers: gated cross-attention + gated FF, initialized at 0
Image-causal masking	Text tokens attend only to immediately preceding image, not all previous images
Model sizes	Flamingo-3B, 9B, 80B (based on Chinchilla 1.4B, 7B, 70B)

Training setup (high level):
Objective: Autoregressive text generation conditioned on interleaved visual inputs.
Loss: Weighted sum of per-dataset negative log-likelihoods.
Training strategy: Gradient accumulation over all datasets (outperforms round-robin).
Few-shot adaptation: No fine-tuning; simply prompt with (image, text) example pairs followed by query.

6. Multimodal / Integration Aspects (If Applicable)¶

Modalities integrated:
Vision (images/videos) and text through late fusion with cross-attention.
How integration works:
Vision and language processed separately (frozen encoders), then fused via GATED XATTN-DENSE layers.
Perceiver Resampler bridges vision encoder and language model by converting visual features to tokens.
Interleaved sequences: Support arbitrary mixing of visual and textual inputs through image-causal masking.
Why this integration is useful / new capabilities:
Few-shot learning: Model adapts to new tasks by seeing a few (image, text) examples.
Open-ended generation: Can generate captions, answers, descriptions conditioned on images/videos.
Multi-image reasoning: Processes sequences of multiple images with interleaved text (e.g., visual dialogue).
Zero-shot capabilities: Works out-of-the-box on tasks not seen during training.

7. Experiments and Results¶

Benchmarks:
16 diverse tasks: VQAv2, OK-VQA, COCO captioning, TextVQA, VizWiz, MSRVTTQA, VATEX, VisDial, HatefulMemes, etc.
Baselines:
Fine-tuned task-specific models, CLIP, other VLMs.
Key findings (trends):
State-of-the-art few-shot performance: Flamingo-80B sets new SotA on 9 of 16 tasks with 4-32 shots.
Outperforms fine-tuned models on 6 tasks despite using only 32 examples (vs. thousands for fine-tuning).
Performance by task:
- VQAv2: 57.8% (32-shot) vs. 80.2% (fine-tuned SotA)
- COCO captioning: 113.8 CIDEr (32-shot) vs. 143.3 (fine-tuned)
- Strong video understanding on MSRVTTQA, VATEX, NextQA
Scaling: Performance improves with model size (3B → 9B → 80B) and number of shots (0 → 4 → 32).
Ablations:
Perceiver Resampler outperforms plain Transformer and MLP alternatives.
Gating mechanism (tanh initialization) improves training stability and performance.
Image-causal masking (attend only to immediately preceding image) outperforms attending to all previous images.
Dataset weighting is crucial; gradient accumulation outperforms round-robin sampling.

8. Strengths, Limitations, and Open Questions¶

Strengths:

Powerful few-shot learning: Achieves SotA on many tasks with just 4-32 examples, dramatically reducing annotation requirements.
Open-ended generation: Can generate free-form text (captions, answers) unlike contrastive models (CLIP).
Handles diverse tasks: Single model works on classification, captioning, VQA, dialogue, video understanding.
Leverages pretrained models: Effectively combines frozen vision and language models, preserving their knowledge.
Scalable architecture: Works across model sizes (3B to 80B) with consistent improvements.

Limitations:

Still behind fine-tuned models: On some tasks, fine-tuned models with thousands of examples outperform Flamingo's few-shot performance.
Compute intensive: Training on billions of examples and 80B parameters requires massive compute resources.
Limited to vision-language: Doesn't handle other modalities (audio, 3D, biological data).
Frozen encoders: Cannot adapt vision or language encoders to new domains without retraining.

Open questions and future directions:

How can few-shot performance be further improved to match or exceed fine-tuned models across all tasks?
Can similar architectures be extended to other modalities (audio, 3D scenes, biological data)?
How to make training more compute-efficient while maintaining few-shot capabilities?
Can the gated cross-attention mechanism be adapted to biological multimodal settings (gene-brain-behavior)?

9. Context and Broader Impact¶

Position in the landscape:
Flamingo demonstrates that large-scale web data training enables powerful in-context learning capabilities (previously seen only in text-only LLMs) for multimodal tasks.
Bridges the gap between contrastive models (CLIP) and generative models, offering both zero-shot and few-shot capabilities.
Relation to well-known ideas:
Extends GPT-3's few-shot learning paradigm to vision-language tasks.
Uses Perceiver-style cross-attention for vision-language bridging.
Combines ideas from frozen encoders (preserving pretrained knowledge) and trainable connectors (enabling multimodal fusion).
Why this paper is a useful reference:
For multimodal FM research: Provides a blueprint for bridging vision and language FMs with minimal trainable parameters.
For gene-brain-behavior integration: Architectural principles (Perceiver Resampler, gated cross-attention, interleaved sequences) could be adapted to biological multimodal settings.
For few-shot learning: Demonstrates the power of large-scale web data for enabling in-context learning.

10. Key Takeaways (Bullet Summary)¶

Problem:
Vision-language models require extensive fine-tuning; goal is to enable few-shot learning like GPT-3.
Method / model:
Flamingo is a family of VLMs (3B to 80B) that bridge frozen vision encoders and language models via Perceiver Resampler and GATED XATTN-DENSE layers.
Trained on billions of web-scraped multimodal examples (interleaved image-text, image-text pairs, video-text pairs).
Results:
State-of-the-art few-shot performance on 16 benchmarks; outperforms fine-tuned models on 6 tasks with only 32 examples.
Largest model (Flamingo-80B) sets new records on VQA and captioning.
Why it matters:
Shows that large-scale web data enables powerful in-context learning for multimodal tasks.
Demonstrates effective architectural patterns for bridging frozen pretrained models.
Provides a reference for adapting few-shot learning to biological multimodal settings (gene-brain-behavior).