跳到主要内容

多模态 AI

定义

多模态 AI 在一个系统中处理多种输入(有时是输出)模态: 例如 vision-language models (VLMs) for image QA, captioning, or embodied agents that use vision and language.

它扩展了 NLP and computer vision by aligning or fusing modalities (text, image, audio, video). CLIP-style models learn a shared embedding space for 检索 and zero-shot classification; generative VLMs (例如 LLaVA, GPT-4V) do captioning, QA, and 推理 over images. Used in agents, RAG with images, and accessibility tools.

工作原理

每种模态(文本图像,以及可选的其他模态)通过编码器(独立或共享的)。融合 can be a shared embedding space (例如 CLIP: contrastive loss so matching text-image pairs are close) or a unified transformer that attends over all modalities. Output can be a classification, caption, answer, or next token. Alignment is learned via contrastive (CLIP) or generative (captioning, VLM) objectives on paired data. Inference: embed or encode all inputs, then run the fusion model to get the output.

应用场景

Multimodal models fit when the task combines two or more modalities (例如 text and image) in input or output.

  • Image captioning, visual QA, and document understanding (text + image)
  • Cross-modal 检索 (例如 search images by text, or vice versa)
  • Embodied agents and robots that use vision and language

外部文档

另请参阅