Computer vision (CV)
Definition
Computer vision enables machines to interpret images and video: classification, detection, segmentation, tracking, and generative tasks. CNNs and vision transformers are core building blocks.
It overlaps with multimodal when combining vision and language (e.g. VLMs). Generative CV uses diffusion or GANs. Most pipelines follow a backbone (feature extraction) plus task head; transfer learning from ImageNet or similar is standard.
How it works
The image (or video frame) is fed into a backbone (e.g. ResNet, ViT) that outputs features (spatial feature maps or patch tokens). A head (one or more layers) maps features to the output: classification (logits per class), detection (boxes + classes), segmentation (mask per pixel), or generation (e.g. diffusion). Backbones are usually pretrained on large datasets (e.g. ImageNet) then fine-tuned with the head on the target task. Data augmentation, normalization, and loss design (e.g. focal loss, mask head) are task-specific.
Use cases
Computer vision is used wherever you need to interpret or generate images and video (detection, segmentation, recognition).
- Object detection, instance segmentation, and tracking
- Image classification and recognition (e.g. medical, satellite)
- Video understanding and action recognition