Vocal Synthesis DSP resources

Vocal Synthesis DSP resources

A curated outline of various literature related to Vocal Synthesis, with a particular emphasis on singing synthesis and physically based singing synthesis techniques.

WIP. This is a bit of a braindump at the moment. I don't have links yet since I'm jotting this down quickly. But hopefully there are enough clues to track down this information.

Julius Smith Singing Synthesis Page: good historical background on early (1960s!) physical modelling of the singing voice, and the Kelly-Lochbaum Scattering Junction.

Perry Cook Dissertation. Perry Cook's Dissertation on Singing Synthesis builds on top of a lot of the existing literature in articulatory and physically based speech synthesis, but in the context of singers specifically. The major novelty in this paper is the use of pulsed noise in the glottal excitation signal to produce more realistic effects. His opening section on "singing vs speech synthesis" is wonderful, and really drives home the fact that singing is far more than "pitched speech".

Perry Cook's realtime physical modelling sound book has a brief portion dedicated to vocal tract waveguides which is good info.

The US patent for the vocal tract waveguide synthesizer (US5528726) is actually reasonably comprehensible. Perry Cook is Co-Author, and if you look closely, you'll see similarities between his dissertation and this patent. I believe the equations in the patent are correct, and that the dissertation has typos in some of the formulas. It's was nice to have this for comparison.

Hui-ling "Vicky" Lu's dissertation on Glottal Source modelling is probably one of the most helpful resources I've found for understanding Glottal Source Modelling. Similar to Perry Cook, Lu's work is specifically geared towards singing rather than speech, which is quite rare to find. There's also a paper that Lu and Smith wrote that's worth reading as well alongside the relevant bits of the dissertation.

Jack Mullen PhD dissertation. The primary focus of this thesis was the development of a 2d waveguide mesh to produce a vocal tract physical model, as opposed to the 1d waveguide used in "classical" cylindrical tube models like the Kelly-Lochbaum vocal tract. There's a lot of well written background information on the 1d cylindrical tube model. The derivations used in this paper served as the baseline for the reflection coefficient computations and waveguide computation used in Pink Trombone, tract, and voc. According to Neil Thapen, the author of Pink Trombone, this was a helpful resource during Pink Trombone. (I should note that some people erroneously claim that PT implements a 2d waveguide mesh. It does not. It is a classic 1d waveguide, which is a bidirectional delay line.).

Pink Trombone, while not the easiest code to read (and honeslty, DSP code always sucks to read), has been a valuable reference implementation for many of these models, and I owe a lot to Neil Thapen and that project. Much of my work is derived from that project. To this day, there are still parts of it I don't fully understand.

Gnuspeech, which I believe is now a dead project, was at one point a state-of-the-art speech synthesizer utilizing articulatory speech synthesis. They actually published a paper on some of the internals of the synthesis they used, which they call a "Tube Resonant Model" (which is more commonly referred to as a Waveguide). One of the few resources that goes into any detail about implementing Mrayati's "Distinct Region Model" (other than Mrayati's paper, which wasn't easy to access).