Zum Hauptinhalt springen

Multimodale KI

Definition

Multimodale KI verarbeitet mehrere Eingabe- (und manchmal Ausgabe-) Modalitäten in einem System: z. B. vision-language models (VLMs) for image QA, captioning, or embodied agents that use vision und Sprache.

Es erweitert NLP and computer vision by aligning or fusing modalities (text, image, audio, video). CLIP-style models learn a shared embedding space for Abruf and zero-shot classification; generative VLMs (z. B. LLaVA, GPT-4V) do captioning, QA, and Schlussfolgern over images. Used in agents, RAG with images, and accessibility tools.

Funktionsweise

Each modality (text, image, und optional others) wird durch geleitet encoders (separate or shared). Fusion can be a shared embedding space (z. B. CLIP: contrastive loss so nachzuahmening text-image pairs are close) or a unified transformer that attends over all modalities. Output can be a classification, caption, answer, or next token. Alignment is learned via contrastive (CLIP) or generative (captioning, VLM) objectives on paired data. Inference: embed or encode all inputs, then run the fusion model to get the output.

Anwendungsfälle

Multimodal models fit wenn die task combines two or more modalities (z. B. text and image) in input or output.

  • Image captioning, visual QA, and document understanding (text + image)
  • Cross-modal Abruf (z. B. search images by text, or vice versa)
  • Embodied agents and robots that use vision und Sprache

Externe Dokumentation

Siehe auch