Vocal Synthesis Timeline

Vocal Synthesis Timeline

A very rough timeline of vocal synthesis technologies. At the moment, just jotting down things from memory. Obviously not comprehensive by any means.

References and citations to come, when I have that built into this wikizet.

There's a lot more to say on Speech Synthesis than Singing synthesis. Too much, actually. So, my scope is going to try to skew towards singing synthesis and musical applications of speech synthesis.

1930's and 1940s: Voder

In the late 30s and 40s, Bell Labs creates the Voder, an interface for controlling an electronic voice. The Voder, despite being synthesized using rudimentary electronical components, had a surprising range of speech prosody thanks to the interface. The Voder was notoriously difficult to control, and very few people were capable of effectively performing with it. Such is usually the trade-off with articulative control of artificial voice. It is hard to have interfaces for artificial voice control with high ceilings and low floors.

1960s: Early Physical Models

Physically-based computer models for the vocal tract have been around since the 60s, and singing computers have existed for almost as long. 1962, John L. Kelly and Carol C. Lochbaum publish one of the first software implementations of a physical model of the Vocal Tract. This model was used the year before to be the singing voice in "Daisy Bell" by Max Matthews, the first time a computer would be taught sing, and perhaps one of the earliest significant works of computer music. This work would go on to influence the creation of HAL in 2001 a Space Odyssey. HAL would then set the expectations for what a disembodied computer voice be, which can still be felt today in todays virtual assistants.

70s and 80s

Cheaper hardware leads to concatenative speech synthesis. Decline in Spee

1976: Byte Magazine Volume 00, Number 12: Speech Synthesis https://archive.org/details/byte-magazine-1976-08

1984: MacSpeak Demo on original macintosh.

1991: Perry Cook and Singing Synthesis

In 1991, Perry Cook publishes a seminal work on articulatory singing synthesis. In addition to creating novel ways for analyzing and discovering vocal tract parameters, Cook also builds an interactive GUI for realtime singing control of the DSP model. This is perhaps the earliest time realtime control of such models has been possible, thanks to the hardware improvements.

Early 2000s

Vocaloid

In the early 2000s, a commercial singing synthesizer known as Vocaloid is born. Under the hood, Vocaloid implements a proprietary form of concatenative synthesis. Voice sounds for Vocaloid are created by meticulously sampling the performances of live singers. Still in development today, Vocaloid has a rich community and is most definitely considered "cutting-edge" for singing synthesis in the industry.

One of the interesting things about Vocaloid is how they address the uncanny valley issues that come up when doing vocal synthesis. Each voice preset, or "performer", is paired with an cartoon anime performer with a personality and backstory. Making them cartoons does a lot to steer them away from the uncanny valley. Unlike most efforts in speech synthesis, fidelity and even intelligibility are less important. As a result, Vocaloid has a distinct signature sound that is both artificial yet familiar.

Mullen (2006?): 2d-waveguide of KL model

Late 2010s-Present

Machine Learning, TTS

Machine learning techniques used to train data. Considered to be state of the art for speech synthesis

Google Wavenet (circa 2016?):

https://en.wikipedia.org/wiki/WaveNet

2020s:

Some singing synthesis using GANS, but not every interactive.

Interactive Web Programs: Pink Trombone, and Blob Opera

Faster computers means more interactive singing models!

Relatively recent developments of the web browser have yielded very interesting interfaces for musical control of artificial voice. In late 2020, Google releases Blob Opera, an interactive acapella singing quartet of anthopomorphic blobs, allegedly using machine learning to produce chord progressions. A few years earlier, Adult Swim releases "Choir", a web audio powered quartet with a similar premise. These are both collaborations by David Li and Chris Heinrichs. The vocal models here sounds physically based, but I'm yet to confirm this as the source code is not available.

While not deliberately musical like "Blob Opera" or "Choir", Pink Trombone is a fantastic web app developed by Neil Thapen, predating the two efforts by a few years. Touted as low-level speech synthesizer, the Pink Trombone interface is a anatomic split view of a vocal tract that can be manipulated in realtime using the mouse or pointer on mobile. The underlying model is a variation of the Kelly-Lochbaum physical model, utilizing an analytical LF glottal model. Pink Trombone serves as the basis of Voc, a port I made of the DSP layer to ANSI C using a literate programming style.

Much of Neil Thapen's work in Pink Trombone can be traced back to Jack Mullens DSP dissertation on using 2d waveguides vocal tract control.

iOS

HOWL comes to mind as a pretty decent singing synthesizer with a pretty novel interface. Circa 2015-2016?