Aller au contenu principal

Computer vision (CV)

Définition

La vision par ordinateur permet aux machines d'interpréter les images et la vidéo: classification, detection, segmentation, tracking, and generative tasks. CNNs and vision transformers are core building blocks.

Il chevauche multimodal when combining vision and language (par ex. VLMs). Generative CV uses diffusion or GANs. Most pipelines follow a backbone (feature extraction) plus task head; transfer learning from ImageNet or similar is standard.

Comment ça fonctionne

The image (or video frame) est alimenté dans un backbone (par ex. ResNet, ViT) qui produit features (spatial feature maps or patch tokens). A head (one or more layers) maps features to the output: classification (logits par classe), detection (boxes + classes), segmentation (mask per pixel), or generation (par ex. diffusion). Backbones are usually pré-entraîné sur large datasets (par ex. ImageNet) then fine-tuned with the head on the target task. Data augmentation, normalization, and loss conception (par ex. focal loss, mask head) are task-specific.

Cas d'utilisation

Computer vision is used wherever you need to interpret or generate images and video (detection, segmentation, recognition).

  • Object detection, instance segmentation, and tracking
  • Image classification and recognition (par ex. medical, satellite)
  • Video understanding and action recognition

Documentation externe

Voir aussi