Convolutional neural networks (CNN)
Definition
CNNs use convolutional layers to capture local patterns (edges, textures) and build hierarchical features. They are the standard backbone for image classification, detection, and segmentation.
Unlike dense neural networks, convolutions share weights across space, so they are translation-equivariant and efficient for images and other grid-like data. They form the backbone of most computer vision systems and are also used in transformers for patch embedding.
How it works
The image (or feature map) is fed into convolutional layers: each filter slides over the input and computes dot products, producing activation maps that highlight local patterns (edges, textures). Pooling (e.g. max pooling) downsamples spatially, reducing size and adding slight invariance. Deeper conv layers see larger receptive fields and capture more abstract features (parts, objects). The final class (or detection/segmentation) head is usually one or more dense layers on top of the flattened or pooled features. Training uses the same backprop and gradient descent as other deep learning models.
Use cases
CNNs are the standard for any task where spatial structure (images, video, or 2D/3D signals) matters.
- Image classification (e.g. object recognition, medical image analysis)
- Object detection and instance segmentation
- Video analysis and action recognition