跳到主要内容

边缘推理

定义

Edge 推理 运行 轻量级推理或推断 在边缘设备上—phones, IoT gateways, cameras, vehicles—而不是在云端. 目标是 low latency, offline capability, privacy (data stays on device), and reduced bandwidth by doing as much work locally as possible.

它结合了小型或蒸馏的 LLMs, model compression (quantization, pruning), and hardware-friendly runtimes (TFLite, ONNX Runtime, Core ML). Techniques like speculative decoding, early exit, and mixture-of-experts (with small experts) can reduce compute per token so 推理 patterns (例如 chain-of-thought) remain viable at the edge.

工作原理

Edge device (phone, gateway, embedded system) holds a small or compressed model (例如 distilled transformer, quantized LLM). Input (sensor data, text, or a prompt) is fed to the model; 推理 may be a short chain-of-thought or a single forward pass. Early exit skips later layers when the model is confident; speculative decoding uses a small draft model locally and optionally verifies with a larger model when online. Output is returned without a round-trip to the cloud (or with optional cloud fallback).

应用场景

Edge 推理 applies when you need low-latency or offline 推理 on devices with limited compute and memory.

  • Smart assistants and wearables that answer or act without a constant cloud connection
  • Vehicles and robotics where latency and offline operation are critical
  • Privacy-first apps (health, home) that keep sensitive data on-device
  • Cost and bandwidth reduction by moving simple 推理 from cloud to edge

优缺点

ProsCons
Low latency, no round-trip to cloudSmaller models; less capable than large cloud LLMs
Works offline and in poor connectivityHardware constraints (memory, power, thermal)
Data stays on device for privacyTrade-off between model size and 推理 quality
Lower bandwidth and cloud costRequires quantization and compression

外部文档

另请参阅