Weekly Update 3: Synthesizing the Voice of a Tiny Creature


"goblin remembering" from Gestlings

Suddenly, a Tiny Being

This week, I found myself synthesizing the voice of a Tiny Creature. Here's what it ended up sounding like:

What's special for me here is that the voice is tiny. It is a new trick I managed to teach my singing computer this week. It is something I managed to figure out while exploring some other singing synthesizers and trying to dig into how they work.

Tiny Creatures Are Narrative Magnets

Constructed sounds like these interest me because of the way they conjure up story and character. This sound is entirely synthesized from math and code, yet my brain can't help but put a face and backstory on it. For example, I like to imagine this voice coming from some kind of Tree Sprite, excitedly telling you about their morning adventure searching for mushrooms up by the swamp. This kind of sound design with an attached character narrative around is mechanic I like to play with, as can be seen/heard with goblins and computer on the phone with his mother.

Choir: A Compelling Interactive Singing Barbershop Quartet

The Choir interactive musical interface found on Adult Swim is a big inspiration for me, right up there with Pink Trombone. It's made by the same people who did Blob Opera, only this one was made first, and to me it sounds better (Blob Opera sounds a little over-engineered to me: they got too into making different phonemes with fricatives, and there's too much reverb).

My Previous Attempts To Synthesize Vocal Ensembles

Before we can talk about Synthesize Tiny Voices, we need to talk about Synthesizing singing voice ensembles.

By Singing Computer standards, Choir is an excellent sounding ensemble, and this is one of the primary reasons why I love it so much. Humans singers work very hard to blend when they are singing together. Robot singers tend to have the opposite problem of being so precise that the voices melt into eachother, resulting in a thin texture.

I know how robot singing enembles fail from personal experience. You could say there's some history.

In an experiment I did back in 2021, I worked out a barbershop tag arrangement using 4 instances of an iteration of my vocal synthesizer. In my attempt, I added variations to pitch, vibrato depth, and maybe even vowel shape to try and make the voices pop out more in the mix.

The end result is okay, but it certainly wasn't "Choir", and not even close to richness of the human performed version. Golly, what a sound. Wish I could get stuff to sound like that.

This wasn't my only attempt at ensembles. Later that year, I would try to do ensembles a few more times for looptober. You can hear how the synthesized voices singing together begin to start to sound like an organ:

You can also hear this organ-like in the background vocals in this doo-woppy arrangment here:

Things got a little bit better when I started using more "breathy alto" and less "intense tenor/baritone". The textures in these little "bossa nova" and "jazzy" sketches are a bit of an improvement, but still the voices "clump".

I started trying to add small differences to the voices in ways I knew how, like in pitch and timing. This also helped a bit for this bit inspired by Howard Shore's Passing of the Elves:

This piece, which makes me think of singing angels, has some good vocal textures to it as well, but I think it's less to do with the voices and more to do with "pretty chords" and slides (the pitch sliding was a great suggestion by someone, and it helped a lot):

So, what does Choir do to get their singing synthesizers to sound so good together? Truthfully, I am still not 100% sure, but this week I found some pretty strong clues. (And yes, this will lead up to the tiny voices).

Hunting for Clues in "Choir"

Choir is not Open Source. I can't look at their code and see how they made their singing synthesizer algorithm. But, there are some things available to look at. Choir is made up of a handful of scripting code (JavaScript) and executable code (Web Assembly). The executable code is where the meat of the singing synthesizer stuff is, but also the densest thing to look at. The Javascript is a little bit easier to look at. While it has been obfuscated and minified, there are ways to make it a bit more readable, and this ended up being an entry point to get some hints.

The JavaScript code can be thought of as a kind of middleman between the web browser and the singing DSP code. It loads up the webassembly code containing the singing DSP code, and pokes at it, telling it what notes to play and when to render sound. These communications have human-readable names to it, such as "gurgle" and "hoarseness". There's a lot that can be assumed by underlying model by the kind of things it is being told to do. I had guessed that the underlying model was in the family of physically based articulatory singing synthesis models like the ones I build, and some of the control message names would seem to suppor this this: like "levelsglottis", "levelstract", and "levels_fricative". What's very interesting is that if you have a local copy of Choir, you can change some of these parameters in the code.

One particular parameter of interest to me was something called "tract_scale". Each of the 4 voices had a slightly different scale, set to some value close to 1, like 1.02, 1.03, and 0.93. As an experiment, I isolated the "lead" vocalist, and cranked its "tract_scale" up to 2. When I reloaded the Choir program, I had a voice that sounded like a singing child. Not only that, but it seemed to be singing the same "ah" vowel sound at the same pitch, just from a smaller voice. Counterintuitively, a larger tract scale seemed to create a vocal tract that with a smaller length. Different vowel sounds have particular shapes. This seemd to be taking a particular shape, and stretching or squashing it to fit the overall tract.

Tract scale seems to be one of the ways Choir is able to get such a rich texture. By slightly varying to size of each singers vocal tract, each singer is given a slightly different timbre and vocal quality, which in turn adds more distinctness to the mix.

Teaching the singing computer new tricks...

But was it really that easy? Treating a vowel like a rubber mask and stretching it out to fit on a tract of particular length? That mechanical singing voice that I accidentally miniaturized was a siren call for me. There was no way I wasn't going to have this kind of sound in my system. I now knew what I was missing. I needed tiny voices. I had to try.

With my text editor opened up to the code of my singing synthesizer, I quickly chiselled out a crude equivalent of "cranking the tract scale to 2" like I had done in Choir. Sure enough, the chipmunkification of the voice could be heard. Go figure. I garnished it with a bit of character and enthusiasm, and created the tiny being played at the beginning of this blog post.

This was a only proof of concept, and I had indeed proven the concept. The next phase was actually building something I could use more than once. So, I coded up a new version of the vocal tract component, where the overall length could be adjusted. I taught a critter how to say "oo", then "ah", then how to wildly alternate between the two:

The program to generate this sound was first written in C, and then rewritten in LIL, the scripting language that is shipped with sndkit DSP library and mnolth.


Overall, this week has had some great unexpected discoveries. Vocal tract scaling gives some good amunition for the Gestlings project, whose goal is to design sounds with personality.