Computer vision (CV)

定义

计算机视觉使机器能够解释图像和视频: classification, detection, segmentation, tracking, and generative tasks. CNNs and vision transformers are core building blocks.

它与…重叠 multimodal when combining vision and language (例如 VLMs). Generative CV uses diffusion or GANs. Most pipelines follow a backbone (feature extraction) plus task head; transfer learning from ImageNet or similar is standard.

工作原理

The image (or video frame) 被输入到一个 backbone (例如 ResNet, ViT) 输出 features (spatial feature maps or patch tokens). A head (one or more layers) maps features to the output: classification (logits 每类), detection (boxes + classes), segmentation (mask per pixel), or generation (例如 diffusion). Backbones are usually 预训练于 large datasets (例如 ImageNet) then fine-tuned with the head on the target task. Data augmentation, normalization, and loss 设计 (例如 focal loss, mask head) are task-specific.

应用场景

Computer vision is used wherever you need to interpret or generate images and video (detection, segmentation, recognition).

Object detection, instance segmentation, and tracking
Image classification and recognition (例如 medical, satellite)
Video understanding and action recognition

Computer vision (CV)

定义

工作原理

应用场景

外部文档

另请参阅

定义​

工作原理​

应用场景​

外部文档​

另请参阅​

定义

工作原理

应用场景

外部文档

另请参阅