Saltar al contenido principal

Compresión de modelos

Definición

La compresión de modelos reduce el tamaño, latencia o memoria de los modelos so they can run on edge or with limited compute. Methods include pruning, quantization, and knowledge distillation.

Úselo cuando the full model is too large for deployment (por ej. LLMs on edge, real-time serving). Trade off accuracy vs size/latency; often combine several methods. See infrastructure for how compressed models are served at scale.

Cómo funciona

You start from a modelo grande and apply one or more compress steps. Pruning removes low-importance weights or structures (unstructured or channel-wise). Quantization stores weights (and optionally activations) in lower precision (por ej. INT8). Distillation trains a smaller small model (student) to mimic the large one (teacher) via soft labels or representations. El resultado es un smaller, faster model; accuracy is validated on a dev set. Methods are often combined (por ej. prune then quantize, or distill then quantize) and may require fine-tuning to recover accuracy.

Casos de uso

Model compression is used when you need smaller or faster models for edge, mobile, or cost-sensitive production.

  • Deploying modelo grandes on edge or mobile with limited memory
  • Reducing inference latency and cost in production
  • Combining pruning, quantization, and distillation for maximum compression

Documentación externa

Ver también