Dictionary

Text-to-speech (TTS)

Text-to-speech (TTS) turns written text into spoken audio. It evolved from clunky mechanical machines to modern AI voices that sound almost human. You will find it in accessibility tools, automation, AI voice agents, and content production.

What is text-to-speech (TTS)?

Text-to-speech, or TTS, is technology that turns written text into spoken audio. You hear it in navigation apps, digital assistants, AI voice agents, call centres, e-learning platforms, and audiobooks. Voices range from the basic robotic tones of the early days to highly lifelike AI voices that adjust intonation, emotion, and pace on the fly.

At its core, TTS comes down to three steps:

  1. analyse the text

  2. decide on pronunciation and rhythm

  3. generate the audio

Modern systems use neural models that sound far more natural than the early generations ever managed.

How does TTS work?

1. Text analysis

The model reads the text and identifies sentences, punctuation, numbers, and abbreviations. Smarter systems even pick up on semantics so they can place pauses or emotional cues in the right spots.

2. Linguistic conversion

The system works out how each word should sound: stress, intonation, rhythm, and pace.

3. Audio generation

The engine converts the prepared text into audio. There are three broad approaches:

  • Formant synthesis: fully synthetic sound built from scratch (older, robotic).

  • Concatenative synthesis: real recorded speech fragments stitched together (more natural).

  • Neural TTS: AI models that generate speech directly from waveforms (very natural, with flexible emotion and pacing).

A short history of TTS

The mechanical era (18th and 19th century)

In 1779, Wolfgang von Kempelen built a mechanical speaking machine that produced sounds using bellows and reeds. It was not real speech synthesis, but it was a milestone in modelling the human voice.

The electronic era (1930 to 1960)

In 1939, Bell Labs unveiled the Voder. An operator pressed keys to form sounds. It was the first electronic speech system.

Formant models (1960 to 1980)

Researchers modelled the resonance of the human vocal tract. The output was robotic but intelligible. This led to the first computer-driven TTS systems.

DECtalk and the robot voices (1980 to 1990)

The DECtalk system became iconic. Stephen Hawking famously used a variant of it. The speech was mechanical but useful for accessibility and early call centres.

Concatenative synthesis (1990 to 2010)

TTS shifted to using real audio fragments. The result was much more natural, but harder to adapt. Navigation systems and telephony adopted it at scale.

The neural revolution (2016 onward)

DeepMind introduced WaveNet, followed by models like Tacotron, FastSpeech, Glow-TTS, and VITS. They produce fluid, realistic speech and can shape emotion, style, and context.

The pioneers behind commercial TTS

One name keeps surfacing in any history of speech technology: Lernout & Hauspie (L&H). The Belgian company grew through the 1990s into a global player in speech recognition and synthesis, building commercial TTS voices at a time when the technology was still mostly an academic curiosity. Their products ended up in call centres, screen readers for the visually impaired, medical dictation systems, and consumer electronics. Around L&H, an ecosystem called Flanders Language Valley formed in West Flanders. After L&H collapsed in 2001, many of its engineers moved to Nuance and ScanSoft, both later absorbed into Microsoft. Academic work at KU Leuven and imec on acoustic modelling and neural speech synthesis kept the region active in the field well into the WaveNet era.

The AI wave: new players, new use cases

From 2020 onward, the sector picked up pace again. TTS suddenly became:

  • more natural sounding

  • cheaper to run

  • usable in real time

  • capable of voice cloning

  • practical in telephony, customer service, and media production

Today's startups and scale-ups focus on areas like:

  • AI voice agents for call centres

  • digital brand voices for companies that want a consistent audio identity

  • data annotation and modelling for less-represented languages and dialects

  • voice interfaces for sectors like healthcare, logistics, and education

  • translation combined with TTS for international communication

Bigger companies are also experimenting with their own branded voice models for internal processes and customer contact.

Last Updated: April 18, 2026 Back to Dictionary
Keywords
text-to-speech TTS AI voice agent speech synthesis NLP neural TTS WaveNet accessibility speech-to-text artificial intelligence