Pular para o conteúdo principal

IA multimodal

Definição

A IA multimodal lida com múltiplas modalidades de entrada (e às vezes saída) em um sistema: por ex. vision-language models (VLMs) for image QA, captioning, or embodied agents that use vision and language.

Ele estende NLP and computer vision by aligning or fusing modalities (text, image, audio, video). CLIP-style models learn a shared embedding space for recuperação and zero-shot classification; generative VLMs (por ex. LLaVA, GPT-4V) do captioning, QA, and raciocínio over images. Used in agents, RAG with images, and accessibility tools.

Como funciona

Cada modalidade (texto, imagem, e opcionalmente outras) é passada por codificadores (separados ou compartilhados). Fusão can be a shared embedding space (por ex. CLIP: contrastive loss so matching text-image pairs are close) or a unified transformer that attends over all modalities. Output can be a classification, caption, answer, or next token. Alignment is learned via contrastive (CLIP) or generative (captioning, VLM) objectives on paired data. Inference: embed or encode all inputs, then run the fusion model to get the output.

Casos de uso

Multimodal models fit when the task combines two or more modalities (por ex. text and image) in input or output.

  • Image captioning, visual QA, and document understanding (text + image)
  • Cross-modal recuperação (por ex. search images by text, or vice versa)
  • Embodied agents and robots that use vision and language

Documentação externa

Veja também