跳到主要内容

Computer vision (CV)

定义

计算机视觉使机器能够解释图像和视频: classification, detection, segmentation, tracking, and generative tasks. CNNs and vision transformers are core building blocks.

它与…重叠 multimodal when combining vision and language (例如 VLMs). Generative CV uses diffusion or GANs. Most pipelines follow a backbone (feature extraction) plus task head; transfer learning from ImageNet or similar is standard.

工作原理

The image (or video frame) 被输入到一个 backbone (例如 ResNet, ViT) 输出 features (spatial feature maps or patch tokens). A head (one or more layers) maps features to the output: classification (logits 每类), detection (boxes + classes), segmentation (mask per pixel), or generation (例如 diffusion). Backbones are usually 预训练于 large datasets (例如 ImageNet) then fine-tuned with the head on the target task. Data augmentation, normalization, and loss 设计 (例如 focal loss, mask head) are task-specific.

应用场景

Computer vision is used wherever you need to interpret or generate images and video (detection, segmentation, recognition).

  • Object detection, instance segmentation, and tracking
  • Image classification and recognition (例如 medical, satellite)
  • Video understanding and action recognition

外部文档

另请参阅