Inférence locale

Définition

Local inference means running LLMs, vision, or other models on your own hardware—a laptop, workstation, on-prem server, or edge device—instead of calling a cloud API. Data never leaves your environment, which supports confidentialité, latency, cost control, and offline use.

Il repose sur model compression (quantization, pruning, knowledge distillation) and efficient runtimes so models fit in limited memory and run without a GPU or with consumer GPUs. Tools like Ollama, LM Studio, llama.cpp, vLLM, and TensorFlow Lite enable local inference with minimal setup.

Comment ça fonctionne

You obtain model weights (par ex. GGUF, SafeTensors) from the Hub or a vendor. A runtime (Ollama, llama.cpp, vLLM, TFLite) loads the model onto CPU, GPU, or NPU and executes the forward pass. Quantization (INT8, INT4, GPTQ, AWQ) shrinks memory so larger models fit; batching and KV cache improve throughput when serving multiple requests. No network call to a cloud API—inference runs entireposent sur the local machine or cluster.

Cas d'utilisation

Local inference fits when confidentialité, latency, cost, or offline operation matters more than using the largest cloud model.

Privacy-sensitive or regulated data (healthcare, legal, internal docs) that must not leave the network
Low-latency or real-time apps (IDE, assistants) where round-trips to the cloud are unacceptable
Cost control at scale or air-gapped / offline environments
Development and testing without API keys or usage limits

Avantages et inconvénients

Pros	Cons
Data stays on your infrastructure	Smaller or quantized models; possible quality drop
No per-token API cost at inference time	You own hardware and ops (GPU, memory, updates)
Works offline and in restricted networks	Throughput and context length limited by hardware
Full control over model version and behavior	Need quantization and compression for larger models

Documentation externe

Ollama — Run LLMs locally with a simple API
llama.cpp — C++ inference for LLaMA and compatible models
vLLM — High-throughput server for local or on-prem LLM serving
TensorFlow Lite — On-device inference for mobile and edge

Définition​

Comment ça fonctionne​

Cas d'utilisation​

Avantages et inconvénients​

Documentation externe​

Voir aussi​

Définition

Comment ça fonctionne

Cas d'utilisation

Avantages et inconvénients

Documentation externe

Voir aussi