Skip to main content

Multimodal AI

Definition

Multimodal AI handles multiple input (and sometimes output) modalities in one system: e.g. vision-language models (VLMs) for image QA, captioning, or embodied agents that use vision and language.

It extends NLP and computer vision by aligning or fusing modalities (text, image, audio, video). CLIP-style models learn a shared embedding space for retrieval and zero-shot classification; generative VLMs (e.g. LLaVA, GPT-4V) do captioning, QA, and reasoning over images. Used in agents, RAG with images, and accessibility tools.

How it works

Each modality (text, image, and optionally others) is passed through encoders (separate or shared). Fusion can be a shared embedding space (e.g. CLIP: contrastive loss so matching text-image pairs are close) or a unified transformer that attends over all modalities. Output can be a classification, caption, answer, or next token. Alignment is learned via contrastive (CLIP) or generative (captioning, VLM) objectives on paired data. Inference: embed or encode all inputs, then run the fusion model to get the output.

Use cases

Multimodal models fit when the task combines two or more modalities (e.g. text and image) in input or output.

  • Image captioning, visual QA, and document understanding (text + image)
  • Cross-modal retrieval (e.g. search images by text, or vice versa)
  • Embodied agents and robots that use vision and language

External documentation

See also