Reconnaissance vocale

Définition

La reconnaissance vocale (ASR) transcrit l'audio en texte. Related areas include speaker identification, speech synthesis (TTS), and spoken language understanding.

Il fait le pont entre multimodal (audio as one modality) and NLP (output is text). Modern ASR is mostly end-to-end neural; self-supervised pretraining (par ex. wav2vec 2.0) reduces the need for huge labeled datasets. Deployed in voice assistants, captions, and meeting tools.

Comment ça fonctionne

Audio (waveform or mel spectrogram) is converted to features (par ex. filter banks, learned representations). An acoustic model (par ex. conformer, wav2vec 2.0 encoder) maps features to frame- or segment-level representations. A decoder (CTC, RNN-T, or attention-based) produces text (characters or subwords). Modern systems are often end-to-end (waveform or features → text in one model). Self-supervised pretraining on unlabeled audio (par ex. wav2vec) then fine-tuning on labeled ASR data improves robustness and reduces labeled data needs.

Cas d'utilisation

Speech technologies apply when the input or output is audio: transcription, assistants, and speaker or synthesis systems.

Automatic speech recognition (ASR) for transcription and captions
Voice assistants and spoken dialogue systems
Speaker identification and speech synthesis (TTS)

Reconnaissance vocale

Définition

Comment ça fonctionne

Cas d'utilisation

Documentation externe

Voir aussi

Définition​

Comment ça fonctionne​

Cas d'utilisation​

Documentation externe​

Voir aussi​

Définition

Comment ça fonctionne

Cas d'utilisation

Documentation externe

Voir aussi