Research Proposal for AIM program

Initial Things

Links

Application: https://www.aim.qmul.ac.uk/apply/

PhD topics: https://www.aim.qmul.ac.uk/phd-topics/ Relevant Topics (from list) Relevant Topics for Me (from the PhD list):

Modelling and Synthesizing Articulation On Acoustic and Digital Instruments

Machine Learning of Physical Models

Multimodal AI for musical collaboration in immersive environments

Performance Rendering For Music Generation Systems

My own areas and specialization

Singing Physical Modelling Synthesis

Audio DSP (for realtime systems)

Musical Interaction Design

Computer Music Composition

Gesture Synthesis

Areas To Investigate

Deep Learning: GANs, Wavenets, etc

Articulatory Synthesis (more advanced)

Late Renaissance Counterpoint, Sacred Choral Music (palestrina, etc)

Working Title

Computer Augmented Techniques in Musical Articulatory Synthesis

Outline

What is Articulatory Synthesis?

Particular branch of speech synthesis.

Attempts to model speech physiologically: synthesize sound based on physical construction of the vocal apparatus.

Physically based model breaks down into two components: glottal airflow (source), tract (filter). Known as Source-filter model. Glottal source signal goes into tract filter, and voice-like sounds come out the otherside.

Human vocal tract is approximated as a series of cylindrical tubes of varying diameters using a 1-d digital waveguide.

Different tract shapes creates different formants and vowels. Account for aerofluidics, and you get frictives and sibilance. Coordinate everything together, and you get speech.

Articulatory synthesis is a white-box model. Signals go in, and vowel formants appear implicitely. Formant synthesis (also source-filter) is a black box model, using resonators tuned to create pre-derived formants.

The Musical Context

Singing!

Singing Synthesis has different priorities than Speech Synthesis. musical phrasing vs prosody, what makes a "realistic" performance, emphasis on intelligibility, etc.

Ensemble

Interest in exploring ensemble. 3-6 instances of the model, and how they can realistically sing together.

Why Articulatory Synthesis?

Why this over other speech synthesis techniques? It's low fidelity compared to wavenet. It's also very expensive compared to techniques like concatenative and formant synthesis.

So what is there?

Expressivity. These systems offer a lot of parametric control over sound output. Granular low level control too. Much better suited for musical performance. People who think about shaping and sculpting sound in music. No other technique is dynamically flexible the way the articulatory vocal tract model is.

When articulatory synthesis was popular, computers were not fast. They are now fast enough to run high-res instances in realtime.

It's a far more dynamic system than the existing singing models out there. With the right training system, it's far less tedious to create new voices or extend new ones.

Computer Augmentation

Take these very old techniques and try to apply new AI techniques to them.

Areas Of Study

"singing synthesis" in the context of articulatory synthesis has many different entry points where new AI can be introduced to solve problems.

I have organized the major problems by scale: timbre, gesture, and ensemble.

timbre

timbre: using AI to faciliate parametric sound control of the model. Two classic areas of research: finding tract shapes, and synthesizing glottal flow signals.

Perhaps a GAN can be used as a sort of correction filter at the end (style transfer, etc)?

Articulatory Inversion: area of research that analyzes voice and returns corresponding tract shapes.

Glottal Excitation Extraction: given a real voice, extract core glottal signal.

These are common topics in the speech world, but not in the context of singing.

For music: exploring vocal techniques outside of the realm of speech that are utilized in music. How can we develop models that sound good as a 4-part SATB ensemble?

"Convincing" as a musical performance but not necessarily realistic.

How to develop and dynamically generate perceptually distinct voices? Reduce the distinctness to sounding like overdubs?

This level would be most focused around physical modelling.

gesture

gesture: the means of articulating and coordinating the model to be musically coherant. Moving between vowel/fricative states and pitch, in a musically appropriate "singing" way.

Developing "interperation", "attitude", and "personality" in a musical setting. Lyricism in computer sequenced works.

Area of ongoing personal research: gesture synthesis. Building systems that melt sequencing and automation curves together. My current research explores a novel set of systems and DSP algorithms that produces audio rate line signal with timing relative to an external control signal (a periodic ramp). The lines and curves generated in this system can also dynamically adapt to tempo flucations in the external signal in realtime without any preconceived tempo information.

This level would be most focused around building smart articulation and performance systems to control the underlying singing model.

ensemble

ensemble: coordinating many voices together in a highly polyphonic setting. dynamic adaptation. things like: phrasing with context-based-awareness (tempo, micro-timings, timbre, etc) , and procedurally generated.

handling one-to-many humans to robots

creating network ensembles of many humans performing singing instruments together in V

a hoarde of robots singing Palestrina.

What this isn't

This isn't speech. Words and intelligibility aren't the point. The focus is on approaching the voice as a musical instrument. It's not about the lyrics.

Fidelity. This is not a focus in so-called "deepfakes" or voice cloning technology. Analysis techniques utilizing real voice samples, and even building models that approximate distinct singers are not out of scope. Success is measured in musical expressivity. Can we build singing digital instruments and interfaces from the ground up that are intuitive and compelling to play.

This is not NLP or TTS. There is no text to speech. There is nothing about formal language models. The focus will be on the sound itself. Not what it is saying. Text to speech is NOT a musical suitable approach for musical control.

Personal Statement

I teach computers how to sing?

Previous Academic Background

Computer Music At Berklee

More DSP and programming at Stanford

I build Musical Software Ecosystems: custom DSP engines, DSLs, live coding environments, novel interfaces and digital instruments.

Design of quirky and unique interfaces that encourage playfulness and exploration.

Why the Voice? It's the only time computer music has a sense of humor.

Why this Research?

As a composer, I have a deep fascination for getting computers to sing, and to explore the implications of what that means exactly. I would be very grateful for the opportunity to pursue this with intense study over a period of time.

Why QMUL?

I know of QMUL from my networks at CCRMA and Berklee.

Highly regarded institution for music tech at the graduate level. Lots of talented people and creative projects. I see a diverse crowd. Not just STEM people, and this is very important to me as an artist-musician-researcher.

Working Draft

See QMUL/prop.txt.