Pular para o conteúdo principal

Computer vision (CV)

Definição

A visão computacional permite que máquinas interpretem imagens e vídeo: classification, detection, segmentation, tracking, and generative tasks. CNNs and vision transformers are core building blocks.

Ele se sobrepõe com multimodal when combining vision and language (por ex. VLMs). Generative CV uses diffusion or GANs. Most pipelines follow a backbone (feature extraction) plus task head; transfer learning from ImageNet or similar is standard.

Como funciona

A imagem (ou quadro de vídeo) é alimentada em um backbone (por ex. ResNet, ViT) que produz características (mapas de características espaciaisre maps or patch tokens). A head (one or more layers) maps features to the output: classification (logits por classe), detection (boxes + classes), segmentation (mask per pixel), or generation (por ex. diffusion). Backbones are usually pré-treinado em large datasets (por ex. ImageNet) then fine-tuned with the head on the target task. Data augmentation, normalization, and loss projeto (por ex. focal loss, mask head) are task-specific.

Casos de uso

Computer vision is used wherever you need to interpret or generate images and video (detection, segmentation, recognition).

  • Object detection, instance segmentation, and tracking
  • Image classification and recognition (por ex. medical, satellite)
  • Video understanding and action recognition

Documentação externa

Veja também