Dictionary

Text-to-speech (TTS)

Text-to-speech (TTS) turns written text into spoken audio. It evolved from clunky mechanical machines to modern AI voices that sound almost human. You will find it in accessibility tools, automation, AI voice agents, and content production.

What is text-to-speech (TTS)?

Text-to-speech, or TTS, is technology that turns written text into spoken audio. You hear it in navigation apps, digital assistants, AI voice agents, call centres, e-learning platforms, and audiobooks. Voices range from the basic robotic tones of the early days to highly lifelike AI voices that adjust intonation, emotion, and pace on the fly.

At its core, TTS comes down to three steps:

analyse the text
decide on pronunciation and rhythm
generate the audio

Modern systems use neural models that sound far more natural than the early generations ever managed.

How does TTS work?

1. Text analysis

The model reads the text and identifies sentences, punctuation, numbers, and abbreviations. Smarter systems even pick up on semantics so they can place pauses or emotional cues in the right spots.

2. Linguistic conversion

The system works out how each word should sound: stress, intonation, rhythm, and pace.

3. Audio generation

The engine converts the prepared text into audio. There are three broad approaches:

Formant synthesis: fully synthetic sound built from scratch (older, robotic).
Concatenative synthesis: real recorded speech fragments stitched together (more natural).
Neural TTS: AI models that generate speech directly from waveforms (very natural, with flexible emotion and pacing).

A short history of TTS

The mechanical era (18th and 19th century)

In 1779, Wolfgang von Kempelen built a mechanical speaking machine that produced sounds using bellows and reeds. It was not real speech synthesis, but it was a milestone in modelling the human voice.

The electronic era (1930 to 1960)

In 1939, Bell Labs unveiled the Voder. An operator pressed keys to form sounds. It was the first electronic speech system.

Formant models (1960 to 1980)

Researchers modelled the resonance of the human vocal tract. The output was robotic but intelligible. This led to the first computer-driven TTS systems.

DECtalk and the robot voices (1980 to 1990)

The DECtalk system became iconic. Stephen Hawking famously used a variant of it. The speech was mechanical but useful for accessibility and early call centres.

Concatenative synthesis (1990 to 2010)

TTS shifted to using real audio fragments. The result was much more natural, but harder to adapt. Navigation systems and telephony adopted it at scale.

The neural revolution (2016 onward)

DeepMind introduced WaveNet, followed by models like Tacotron, FastSpeech, Glow-TTS, and VITS. They produce fluid, realistic speech and can shape emotion, style, and context.

The pioneers behind commercial TTS

One name keeps surfacing in any history of speech technology: Lernout & Hauspie (L&H). The Belgian company grew through the 1990s into a global player in speech recognition and synthesis, building commercial TTS voices at a time when the technology was still mostly an academic curiosity. Their products ended up in call centres, screen readers for the visually impaired, medical dictation systems, and consumer electronics. Around L&H, an ecosystem called Flanders Language Valley formed in West Flanders. After L&H collapsed in 2001, many of its engineers moved to Nuance and ScanSoft, both later absorbed into Microsoft. Academic work at KU Leuven and imec on acoustic modelling and neural speech synthesis kept the region active in the field well into the WaveNet era.

The AI wave: new players, new use cases

From 2020 onward, the sector picked up pace again. TTS suddenly became:

more natural sounding
cheaper to run
usable in real time
capable of voice cloning
practical in telephony, customer service, and media production

Today's startups and scale-ups focus on areas like:

AI voice agents for call centres
digital brand voices for companies that want a consistent audio identity
data annotation and modelling for less-represented languages and dialects
voice interfaces for sectors like healthcare, logistics, and education
translation combined with TTS for international communication

Bigger companies are also experimenting with their own branded voice models for internal processes and customer contact.

Last Updated: April 18, 2026 Back to Dictionary

Keywords

text-to-speech TTS AI voice agent speech synthesis NLP neural TTS WaveNet accessibility speech-to-text artificial intelligence

/ Related

Related Terms

Term

AI Act (EU)

The AI Act is the European Union regulation that governs artificial intelligence. It sorts AI systems by risk and places obligations on anyo...

Read definition

Term

Artificial Intelligence (AI)

Artificial intelligence is technology that teaches computers to learn, reason, and make decisions from data instead of following hand-writte...

Read definition

Term

Bias

Bias in AI is a skew that creeps into models through data, algorithms, or human choices. It is not always harmful, but it has to be managed ...

Read definition

Term

Embeddings

Embeddings turn words, sentences, or images into numbers that capture their meaning. Neural networks learn them from huge amounts of text. T...

Read definition

Term

Generative AI

Generative AI (GenAI) is technology that produces new content on its own, things like text, images, code, or music. It learns patterns from ...

Read definition

/ Further reading

From the blog.

Generative AI and power Bi - Using microsoft Copilot

Article · Jan 26, 2025

Generative AI and Power BI: How Microsoft Copilot Can Transform Your Business

Discover how Microsoft Copilot in Power BI leverages generative AI to simplify report creation, enhance data insights, and empower teams. Le...

Famous data visualizations and their key learnings for better visual storytelling.

Article · Jan 9, 2025

Learning from Famous Data Visualizations

Explore five famous data visualizations, their impact, key takeaways, and best practices for creating impactful visualizations.