Multimodal AI

Definition

Multimodal AI handles multiple input (and sometimes output) modalities in one system: e.g. vision-language models (VLMs) for image QA, captioning, or embodied agents that use vision and language.

It extends NLP and computer vision by aligning or fusing modalities (text, image, audio, video). CLIP-style models learn a shared embedding space for retrieval and zero-shot classification; generative VLMs (e.g. LLaVA, GPT-4V) do captioning, QA, and reasoning over images. Used in agents, RAG with images, and accessibility tools.

How it works

Each modality (text, image, and optionally others) is passed through encoders (separate or shared). Fusion can be a shared embedding space (e.g. CLIP: contrastive loss so matching text-image pairs are close) or a unified transformer that attends over all modalities. Output can be a classification, caption, answer, or next token. Alignment is learned via contrastive (CLIP) or generative (captioning, VLM) objectives on paired data. Inference: embed or encode all inputs, then run the fusion model to get the output.

Use cases

Multimodal models fit when the task combines two or more modalities (e.g. text and image) in input or output.

Image captioning, visual QA, and document understanding (text + image)
Cross-modal retrieval (e.g. search images by text, or vice versa)
Embodied agents and robots that use vision and language

Multimodal AI

Definition

How it works

Use cases

External documentation

See also

Definition​

How it works​

Use cases​

External documentation​

See also​

Definition

How it works

Use cases

External documentation

See also