Saltar al contenido principal

Computer vision (CV)

Definición

La visión por computadora permite a las máquinas interpretar imágenes y video: classification, detection, segmentation, tracking, and generative tasks. CNNs and vision transformers are core building blocks.

Se superpone con multimodal when combining vision y lenguaje (por ej. VLMs). Generative CV uses diffusion or GANs. Most pipelines follow a backbone (feature extraction) plus task head; transfer learning from ImageNet or similar is standard.

Cómo funciona

The image (or video frame) se alimenta en un backbone (por ej. ResNet, ViT) que produce features (spatial feature maps or patch tokens). A head (one or more layers) maps features to the output: classification (logits por clase), detection (boxes + classes), segmentation (mask per pixel), or generation (por ej. diffusion). Backbones are usually preentrenado en large datasets (por ej. ImageNet) then fine-tuned with the head on the target task. Data augmentation, normalization, and loss diseño (por ej. focal loss, mask head) are task-specific.

Casos de uso

Computer vision se usa en cualquier lugar donde you need to interpret or generate images and video (detection, segmentation, recognition).

  • Object detection, instance segmentation, and tracking
  • Image classification and recognition (por ej. medical, satellite)
  • Video understanding and action recognition

Documentación externa

Ver también