Destilación de conocimiento
Definición
La destilación de conocimiento entrena un modelo estudiante más pequeño para igualar las salidas (y a veces representaciones intermedias) de un maestro más grande. El estudiante gains from the teacher’s soft labels and can run with less compute.
Es a model compression technique that preserves more of the teacher’s behavior than training the student on hard labels alone. Used for BERT → DistilBERT, large LLMs → smaller variants, and transfer learning from ensembles.
Cómo funciona
The teacher (modelo grande) produce logits (o embeddings) en datos de entrenamiento. The student (modelo más pequeño) se entrena para igualar the teacher’s logits (por ej. KL divergence with temperature scaling) in addition to or instead of hard labels (ground truth). Temperature softens the teacher distribution so the student learns from dark knowledge (relative scores across classes). Optionally, intermediate layers or attention can be igualared. El estudiante is trained with a mix of distillation loss and task loss; after training it runs with the student’s capacity and latency.
Casos de uso
Knowledge distillation fits when you want a small, fast student that approximates a large teacher for deployment.
- Training smaller, faster models that approximate large ones (por ej. BERT → DistilBERT)
- Enabling deployment when the teacher is too heavy for production
- Transferring knowledge from ensembles or from multiple teachers