Proposal

Proposal

Here lies the PhD proposal template for the Intelligent Instruments program. Hand ported to Org for convenience.

Project Title

provide a working title for your proejct. This may chagne during the course of our research.

Possible title(s):

The Unsung Computer: Explorations in Musically Expressive Computer Augmented Techniques in Articulatory Synthesis.

Less verbose:

Computer Augmented Techniques in Musical Articulatory Synthesis.

Abstract

Short desription of the planned research (max 250 words). This should summaraize the project succintly: the background context, what the key research question is, how it will be answered, and why the proejct matters.

Vocal (Articulatory) synthesis.

Control.

Articulation control systems.

Dimensionality reduction.

Using contemporary AI techniques.

Performance.

Realtime control.

High-dimensionality vocal tract physical models.

This thesis aims to explore novel techniques for controlling white-box singing and vocal synthesis models using contemporary AI techniques. --- Articulatory Synthesis is a branch of Speech Synthesis that uses physically based models of the human vocal tract for sound production. This research aims to develop novel techniques that utilize AI to help musically manipulate and perform these models.

Objectives and research questions

List the main objectives of the planned research and the research questions it seeks to answer. This part should contain a description of th innovative aspects of the research and how it will be an original contribution to knowledge in the field and potentially impact society more generally. Explain how your work relates to the Intelligent Instruments Programme.

---

Dimensionality reduction: how can AI be used to take the high-dimensionality of vocal synthesizers and reduce them to manageable vectors of expression?

What is possible in a realtime audio scenario?

What are meaningful interfaces for control?

Applying vocal articulation techniques to other instruments and sound: what can we learn from the human singing voice?

---

Taking another stab at this.

The objective of this project is to investigate novel and musically significant ways to perform and control physical models of the human vocal tract using artificial intelligence.

From a technical standpoint, the scope of this project can be reduced down to answering this question: How can AI be used to take the high-dimensionality of articulatory vocal synthesizer models and reduce it to more manageable vectors of expression?

Where a vector of expression refers to any sort of high level scalar or hyperparameter that can be hooked up to some kind of sensor or interface.

"Musically meaningful" has technical requirments. Digital musical instruments and interfaces need to be highly responsive in a low-latency realtime audio environment. This will prove to be an interesting problem to solve while using emerging artificial techniques such as machine learning or deep learning, which have toolings and ecosystems that are not usually optimized for realtime use. Form factor is important consideration for musical interfaces. Getting AI solutions working on embedded hardware is another unsolved area of tooling that would be a nice-to-have thing.

This dimensionality reduction problem has many potential answers, and it begs the question: What makes a vector of expression musically meaningful (in the context of controlling an artificial voice)?

This question indeed is a driving force for the craft of digital instrument design and new interfaces for musical expression. Listeners have a particular familiarity with the human voice, so a certain sensitivity is required. It is easy enough to land in the Uncanny Valley where things become too familiar, yet not familiar enough. So the question on top of this would be:

In the realm of artificially produced vocal and vocal-like sound, what is considered "tolerable"?

---

TL;DR

How can AI be used to take the high-dimensionality of articulatory vocal synthesizer models and reduce it to more meaningful vectors of musical expression?

What makes a vector of expression musically meaningful (in the context of controlling an artificial voice)?

What is considered "tolerable" in the realm of artificially produced and vocal-like sound?

And finally:

How can musical frameworks for controlling synthesized voice be applied to other non-vocal sounding virtual instruments?

All of this done in attempt to better connect computer music with the rich culture of music that precedes it.

--- What are the ways that AI can help musical manipulate and perform physically based vocal tract models?

What are the ideal interfaces for articulating disembodied artificial voice? Ideal here meaning designs that are expressive, intuitive, and anthropomorphically sensitive.

How can these new techniques be utilized to push the sonic boundaries of these models?

State of the art and background.

Describe the state of knowledge in the research field as relevent to the current research. What research ahs already been done and what is already known? Who are the key people and institutions in the field? This section should demonstrate a critical engagement with theories and secondary literature as well as other artefacts that might be relevant to your project.

---

This research topic involves a little bit of DSP, a little bit of HCI/NIME.

DSP-wise: speech and singing synthesis has been with us since the begining of computer music. Arguably the first piece of computer music was "Daisy Bell" by Max Matthews which featured a synthesized singing voice. This voice was the Kelly-Lochbaum vocal tract, a whitebox physical model that approximated the tract as a series of cylindrical tubes of varying sizes. Sending a glottal signal through through it caused vocal-like sounds to come out the other side. The interesting thing about this was that it was a white-box model, formant frequencies were produced implicitely by adjusting the diameters of the cylindrical tubes. These measurements were figured out by physically measuring real vocal tracts performing certain vocal sounds.

In many ways, the Kelly-Lochbaum model can still be considered very close to state-of-the-art in speech synthesis techniques. Over the decades it has fallen out of style to other cheaper synthesis techniques, perhaps due to memory hardware becoming cheaper. Some improvements to the model have been made (such as Mullen and the 2d model).

In the 90s, Perry Cooke does cutting edge work in singing synthesis techniques, extending the original Kelly-Locahbaum model as well as exploring ways to control it in a realtime setting.

We see more interesting interactive uses of this physical model in the early 2010s with Pink Trombone by Neil Thapen. Self-described as a "low-level speech synthesizer", Pink Trombone runs in a web browser, and uses a 2d representation to manipulate the tongue, vocal tract, and nasal passages of the model. While fun and very rewarding to perform on, the use of it as a musical instrument is questionable. An adaptation of this model is adapted to literate ANSI C program known as Voc, a work by the author. Some musical attempts were made by the author by controlling Voc inside of the Sporth language (another creation by the Author).

The challenges of these cylindrical tube models are in managing the diameters. Each must be manipulated in such a way that they produce the desired vocal formants. In models such as Voc and Pink Trombone, there are 45 individual diameters to manage. Managing these in a musical expressive way remains an open ended problem. In Pink Trombone, tracts are manipulated using a high-level 2-dimensional macro knob or hyper-parameter, which maps a curve onto the tracts. The Cooke's singing models with Sheila and SPASM have less tracts (10?), and control things.... FIGURE IT OUT.

While the author does not know for certain because the code is not available, the "Choir" web toy released by Adult Swim and the Blob Opera by Google sound like they both use physical models (probably written by the same person too). These are both compelling in that they are interfaces that explore acappella vocal ensembles in a one-to-many interface that makes the user the conductor. This idea has potential to be expanded further.

Meanwhile, in the commercial music industry, the early 2000s sees the beginnings of the cultural phenomenon known as Vocaloid, a singing synthesizer which sees great popularity and an active community to this day. Vocaloid uses concatentive synthesis, essentially sampling the utterences of real vocalists and dynamically stitching them together. It's a painstaking process to make a voice, but the end results that can be achieved are quite compelling.

Formant synthesis models like Klatt is not to be ignored here. Formant synthesis techniques can be thought of as black box physical model, with the formant frequencies for target vowels pre-derived and used to tune some kind of resonator or filter bank in cascade or parallel.

LPC techniques are another.

In the late 2010s, the speech synthesis world is greeted with deep learning techniques like RNNs. Google releases WaveNet which produces very high-fidelity speech results. Some attempts at using RNNs and GANs to produce singing, with varying ranges of fidelity. As far as the author knows, these are all offline require many hours of upfront computing time to train a model.

Deep learning is certainly considered to be cutting edge for producing high-fidelity speech, but the non-realtime control and upfront training makes it unappealing for musical purposes.

--- TL;DR:

Articulatory synthesis and vocal tract models from the 60s, with some improvements in the decades to follow. Research in musical articulatory synthesis is stagnant.

PRC thesis in the 90s is the key player.

Other more efficient speech synthesis systems got more popular: concatentative, formant, LPC, etc. These aren't as configurable as articulatory synthesis.

Vocaloid is the most notable commercial singing synthesis product (concatenative).

RNNs like wavenet is now considered state-of-the-art for speech synthesis, and the fidelity is quite good. Some initial attempts at getting RNNs and GANs to producing singing voices has been done. Compelling, but not musically interesting (minimal control, only metric is fidelity).

Control of artificial speech without using a human mouth or reference voice (talkbox, vocoder) is a difficult problem. The human vocal aritculation system is very complex.

The Voder developed by Bell Labs in the 1930s was an early attempt to build such an interface. It could produce a surprising level of prosody (cite video), but was notoriously difficult to master (find citation).

Vector synthesis is another approach, particularly with formant control. HOWL by Daniel Clelland is a formant synthesizer featuring a 2 dimensional XY pad divided into the 5 regions, each region representing a vowel sound. Very musically rewarding and satisfying to play.

"Pink Trombone" by Neil Thapen: web app that allows one to pull at a virtual vocal tract on screen. Multi-touch. Very compelling concept, but it doesn't sing. This uses a vocal tract model, and the dozens of cylindrical diameters are controlled via empirically derived mathematical function.

"Blob Opera" By Google and "Chorus" by Adult Swim are also musical web interfaces built relatively recently that sing. Both sound like they could be using physical models. These control ensembles rather than individual voices, apparently using machine learning to compose the chords. The arrangement is largely homophonic in nature. Future works in this paradigm could involve more indepently moving polyphonic lines, similar to sacred choral works commonly found in renaissance era (palestrina, etc).

---

Take 2

Research in Articulatory Synthesis for Speech has been relatively stagnant in the last decades, with deep learning being more favored. Musical applications for Articulatory Synthesis such as singing are even rarer to find. We are well overdue for a renaissance.

In the late 30s and 40s, Bell Labs creates the Voder, an interface for controlling an electronic voice. The Voder, despite being synthesized using rudimentary electronical components, had a surprising range of speech prosody thanks to the interface. The Voder was notoriously difficult to control, and very few people were capable of effectively performing with it. Such is usually the trade-off with articulative control of artificial voice. It is hard to have interfaces for artificial voice control with high ceilings and low floors.

Physically-based computer models for the vocal tract have been around since the 60s, and singing computers have existed for almost as long. 1962, John L. Kelly and Carol C. Lochbaum publish one of the first software implementations of a physical model of the Vocal Tract. This model was used the year before to be the singing voice in "Daisy Bell" by Max Matthews, the first time a computer would be taught sing, and perhaps one of the earliest significant works of computer music. This work would go on to influence the creation of HAL in 2001 a Space Odyssey. HAL would then set the expectations for what a disembodied computer voice be, which can still be felt today in todays virtual assistants.

In the 70s and 80s, computer hardware begins to change. Memory becomes cheaper, and mass-production of ICs begin to happen. Lower-quality sounding speech techniques such as LPC, concatenative synthesis, and formant synthesis are able to better leverage of the new hardware and quite fast.

In 1991, Perry Cook publishes a seminal work on articulatory singing synthesis. In addition to creating novel ways for analyzing and discovering vocal tract parameters, Cook also builds an interactive GUI for realtime singing control of the DSP model. This is perhaps the earliest time realtime control of such models has been possible, thanks to the hardware improvements.

In the early 2000s, a commercial singing synthesizer known as Vocaloid is born. Under the hood, Vocaloid implements a proprietary form of concatenative synthesis. Voice sounds for Vocaloid are created by meticulously sampling the performances of live singers. Still in development today, Vocaloid has a rich community and is most definitely considered "cutting-edge" for singing synthesis in the industry.

One of the interesting things about Vocaloid is how they address the uncanny valley issues that come up when doing vocal synthesis. Each voice preset, or "performer", is paired with an cartoon anime performer with a personality and backstory. Making them cartoons does a lot to steer them away from the uncanny valley. Unlike most efforts in speech synthesis, fidelity and even intelligibility are less important. As a result, Vocaloid has a distinct signature sound that is both artificial yet familiar.

Relatively recent developments of the web browser have yielded very interesting interfaces for musical control of artificial voice. In late 2020, Google releases Blob Opera, an interactive acapella singing quartet of anthopomorphic blobs, allegedly using machine learning to produce chord progressions. A few years earlier, Adult Swim releases "Choir", a web audio powered quartet with a similar premise. These are both collaborations by David Li and Chris Heinrichs. The vocal models here sounds physically based, but I'm yet to confirm this as the source code is not available.

While not deliberately musical like "Blob Opera" or "Choir", Pink Trombone is a fantastic web app developed by Neil Thapen, predating the two efforts by a few years. Touted as low-level speech synthesizer, the Pink Trombone interface is a anatomic split view of a vocal tract that can be manipulated in realtime using the mouse or pointer on mobile. The underlying model is a variation of the Kelly-Lochbaum physical model, utilizing an analytical LF glottal model. Pink Trombone serves as the basis of Voc, a port I made of the DSP layer to ANSI C using a literate programming style.

Much of Neil Thapen's work in Pink Trombone can be traced back to Jack Mullens DSP dissertation on using 2d waveguides vocal tract control.

The elephants in the room here are the very recent attempts at getting deep learning speech synthesis to sing. A researcher in AI may mistakenly refer to this as cutting-edge. While it is true the output results are very impressive, these are still speech synthesis studies in musicians clothes, as they tend to focus on fidelity rather than expression.

Research Methodology

Explain which methods you will apply to answer your research questions. How will these methods results in concrete outcomes? Are there any ethical issues that need to be considered? You might want to proide links to online examples of your earlier work here. If relevant this part should also describe access to facilities, materials or data that you need.

Take 1

The research methodology utilized will be largely hands-on with a heavy focus on iterative implementations.

There are four major components to this project: interface, mapping, models, and sound transmission. Interfaceconcerns itself with the physical devices a person uses for control. A good interface aims to capture the gestures of a human performer with a high degree of fidelity and convert them to a machine readable stream or messaging format. The mapping layer receives messages from the interface and converts that to a low-level parameter space that is then sent to the model, which is the articulatory vocal tract synthesizer. The model produces a stream of digital PCM audio which then gets converted to an analogue signal readable by speakers emitting sound, or the sound transmission layer.

The research questions proposed in the previous section mainly have focus in the mapping and interface components. Models by and large will use DSP algorithms and implementations based on previous research. Minimal focus will be placed on the sound transmission layer, and will make use of sound systems that are practically available.

Validation studies.

Ethics: steer clear of deep fakes.

Facilities: most research (mapping) "in-the-box" and software based. Robust hardware prototyping is time consuming and would require assistance. The hope is to utilize off-the-shelf interfaces (MIDI interfaces, tablets, gamepads, etc.), or easy to build interfaces using maker components.

Projects to try to slip in here:

Voc: the starting DSP model to use.

EtherSurface/Orb: Android Apps. Good for rapid prototyping of ideas. Also tests how well instruments work in realtime. Also very portable low-cost demo tool.

Contrenot: Building bespoke interfaces using arduinos and other maker-friendly components. Contrenot pages give some insight into how I approach designing for instruments with simple inputs.

Monolith: my from-scratch live coding environment for sound and DSP with a scheme frontend. Can be used for rapid prototyping.

Take 2

my main research question:

How can AI be used to take the high-dimensionality of articulatory vocal synthesizer models and reduce it to more meaningful vectors of musical expression?

This is mainly iterative software engineering. Starting point is using AI to solve diameters, and writing automated test suites to objectively measure how well this works. The scope of this kind of development and testing will be influenced by the research done to answer the questions.

What makes a vector of expression musically meaningful (in the context of controlling an artificial voice)?

This is where validation studies come into play. "musically meaningful" will need to be elaborated on. It changes on who the listener and performer is.

What is considered "tolerable" in the realm of artificially produced and vocal-like sound?

Validation studies here. To get to the bottom of this, a corpus of interactive and non-interactive stimuli will be produced of artificial voice sounds generated with tools and implementations built.

A starting measurement here could be measuring the level of uncanniness in a particular artificial voice sample. While several studies aim for fidelity, my interest is in finding sonic boundaries that are comfortably artificial.

Non-interactive stimuli like audio/video is very convenient. Can be used in surveys that are easy to distribute. Lots of data points can be found. But a lot is missing without interaction.

Interactive stimuli has more technical considerations. Not as easy to distribute, but more meaningful data comes from this. I would be interested in utilizing the affordances of common computer peripherals like touchscreens, mice, and keyboards to explore parameter spaces and basic interaction mechanics for controlling the voice.

For both validation experiments, I'd like to be able to split participants between those with formal musical training, and those without.

Take 3

In my previous works, I often employ an iterative process for what I refer to as "demo-driven development": the creation of small, tightly scoped works designed to investigate a particular idea, concept, or technical challenge. Typically, this is in the form of some kind compositional etude or interactive sound toy. An effective demo provides insight during the process of its creation, and also is a vehicle for concept or idea that others can easily experience and give feedback to. An ideal outcome for a demo is to have enough momentum to build the next demo, and that gives momentum to the next ones.

This particular kind of process is important for grounding research in things that are "musically meaningful". It is a great litmus for ensuring that work is still being done within musical bounds, and not just a high-level cerebral exercise. After all, If you can't make music with your musical research project, is it really one to begin with?

As a case study, consider my musical DSP library called Soundpipe. This was created after building Ethersurface, when I realized there was a lot of overhead using the entirety Csound to grab at only a handful of opcodes. Soundpipe was initially a distillation of some of favorite Csound algorithms in a small and highly portable C library. In my initial attempts at composing with Soundpipe, I found a tremendous amount of creative friction using the the C language. This lead to creation of Sporth, a stack-based language built on top of Soundpipe that could tersely build modular patches. After using Sporth to compose music, it was adapted to work as a realtime live-coding coding environment which tightened the creative feedback loop between ideas. Performance became an issue when developing interactive instruments and musical puzzle games on Android, so the Sporth paradigm was split up into an engine (Patchwerk) and the language (Runt). As the complexity of the patches grew and more interactive control was desired, more sophisticated structured languages like Scheme were built on top of these environments with tight Monome Arc and Grid integration to build the live-coding computer music environment Monolith.

Validation studies are useful ways to quantify some of the metrics we are trying to find. Especially when it comes to investigating ideal interfaces.

Indicative Timeline

Im this section you describe your work plan over the three years. Of course, this will be subject to change in collaboration with lab members, other proejcts, and input from your supervisors.

Dissertation, Implementation, Demonstration.

Dissertation is the scholarly text.

Implementation is the software from the dissertation.

Demonstrations are bite-sized examples and etudes built using the implementation that present ideas from the dissertation.

The first half of the research period will be spent on building out small demos. Demos and other tangible product are the best way of communicating these initial ideas. These demos will lead to a monolithic implementation as a core library (potentially a literate program). The ideas conveyed in both the demo and the implementation will naturally help drive the dissertation.

The main low-level technical problem involves using AI techniques to find matching parameter spaces given an input speech signal. An initial means to manage these spaces will also be built. On top of this initial work, more focus will be placed on musical gesture analysis from a input vector, as well as gesture synthesis, procedurally generating control signals. With these frameworks in place, a focus on ensemble in two main situations: one performer controlling a virtual ensemble, and many performers controlling a virtual ensemble. Various kinds of computer augmented control will be used to raise or lower the skill level required to perform.

Year 1: initial research, low-hanging fruit

The goal in the first year is to become acquainted with areas of research I'm less familiar with. This will be a time to investigate state-of-the-art AI, and to figure out where it is suitable in the context of interactive computer-generated sound and instruments.

First half

Dissertation is a very rough outline based on this proposal. Subject to change. Malleable.

Implementation: Outlining ideas for small proof of concepts (improvements/rewrite of Voc, etc).

Demonstration: Pre-existing work.

Investigate Low-Hanging Fruit: Using AI methods and overfitting to produce diameter parameter spaces from real speech samples (vowels). From there, exploring meaningful ways to shift between many states, and synthesize new states. Also include fricatives later.

Second Half

Dissertation: More focused idea, input from advisor colleagues.

Implementation: Some working concepts based on previous work.

Demo: Composition/Etude/Toy artfully using implementation, whose existence further illuminates the big ideas surrounding dissertation.

Year 2: second year, rough draft, gesture

Gesture is an important part of vocal expressivity. This year will focus on gesture recognition/analysis and synthesis.

First Half

Dissertation: Very solid outline and structure. Research and investigations still ongoing.

Demo: Interactive Composition featuring synthesized gestures.

Second Half

Dissertation: Rough draft or very very solid outline written out. A cohesive idea that should feel like scholarly work.

Implementation: Major milestones met.

Demonstration: Something significant to show.

Year 3: iteration to final product, building a cohesive narrative

First Half:

Dissertation: Mostly written. Almost done with final draft.

Implementation: Core work nearly done. Nearing completion.

Demonstrations: Core Demos Done at this point.

Second half:

Dissertation: Fully written, final revisions and reading.

Implementation: Done. Finishing touches.

Demonstration: Possible work on bonus demos.

Bibliography

List of main sources in the application. The content of th ebibliography is outside the page or word count.

There are plenty of oline sources on "How to write a successful PhD proposal" and there are various books on the topic, such as "How to write a Watertight Thesis". We recommend that you seek information from these helpful sources in developing your project proposal.