Computer vision (CV)

Definition

Computer vision enables machines to interpret images and video: classification, detection, segmentation, tracking, and generative tasks. CNNs and vision transformers are core building blocks.

It overlaps with multimodal when combining vision and language (e.g. VLMs). Generative CV uses diffusion or GANs. Most pipelines follow a backbone (feature extraction) plus task head; transfer learning from ImageNet or similar is standard.

How it works

The image (or video frame) is fed into a backbone (e.g. ResNet, ViT) that outputs features (spatial feature maps or patch tokens). A head (one or more layers) maps features to the output: classification (logits per class), detection (boxes + classes), segmentation (mask per pixel), or generation (e.g. diffusion). Backbones are usually pretrained on large datasets (e.g. ImageNet) then fine-tuned with the head on the target task. Data augmentation, normalization, and loss design (e.g. focal loss, mask head) are task-specific.

Use cases

Computer vision is used wherever you need to interpret or generate images and video (detection, segmentation, recognition).

Object detection, instance segmentation, and tracking
Image classification and recognition (e.g. medical, satellite)
Video understanding and action recognition

Computer vision (CV)

Definition

How it works

Use cases

External documentation

See also

Definition​

How it works​

Use cases​

External documentation​

See also​

Definition

How it works

Use cases

External documentation

See also