Research Proposal for AIM program
Initial Things
Links
Application: https://www.aim.qmul.ac.uk/apply/
PhD topics: https://www.aim.qmul.ac.uk/phd-topics/ Relevant Topics (from list) Relevant Topics for Me (from the PhD list):
Modelling and Synthesizing Articulation On Acoustic and Digital Instruments
Machine Learning of Physical Models
Multimodal AI for musical collaboration in immersive environments
Performance Rendering For Music Generation Systems
My own areas and specialization
Singing Physical Modelling Synthesis
Audio DSP (for realtime systems)
Musical Interaction Design
Computer Music Composition
Gesture Synthesis
Areas To Investigate
Deep Learning: GANs, Wavenets, etc
Articulatory Synthesis (more advanced)
Late Renaissance Counterpoint, Sacred Choral Music (palestrina, etc)
Working Title
Computer Augmented Techniques in Musical Articulatory Synthesis
Outline
What is Articulatory Synthesis?
Particular branch of speech synthesis.
Attempts to model speech physiologically: synthesize sound based on physical construction of the vocal apparatus.
Physically based model breaks down into two components: glottal airflow (source), tract (filter). Known as Source-filter model. Glottal source signal goes into tract filter, and voice-like sounds come out the otherside.
Human vocal tract is approximated as a series of cylindrical tubes of varying diameters using a 1-d digital waveguide.
Different tract shapes creates different formants and vowels. Account for aerofluidics, and you get frictives and sibilance. Coordinate everything together, and you get speech.
Articulatory synthesis is a white-box model. Signals go in, and vowel formants appear implicitely. Formant synthesis (also source-filter) is a black box model, using resonators tuned to create pre-derived formants.
The Musical Context
Singing!
Singing Synthesis has different priorities than Speech Synthesis. musical phrasing vs prosody, what makes a "realistic" performance, emphasis on intelligibility, etc.
Ensemble
Interest in exploring ensemble. 3-6 instances of the model, and how they can realistically sing together.
Why Articulatory Synthesis?
Why this over other speech synthesis techniques? It's low fidelity compared to wavenet. It's also very expensive compared to techniques like concatenative and formant synthesis.
So what is there?
Expressivity. These systems offer a lot of parametric control over sound output. Granular low level control too. Much better suited for musical performance. People who think about shaping and sculpting sound in music. No other technique is dynamically flexible the way the articulatory vocal tract model is.
When articulatory synthesis was popular, computers were not fast. They are now fast enough to run high-res instances in realtime.
It's a far more dynamic system than the existing singing models out there. With the right training system, it's far less tedious to create new voices or extend new ones.
Computer Augmentation
Take these very old techniques and try to apply new AI techniques to them.
Areas Of Study
"singing synthesis" in the context of articulatory synthesis has many different entry points where new AI can be introduced to solve problems.
I have organized the major problems by scale: timbre, gesture, and ensemble.
timbre
timbre: using AI to faciliate parametric sound control of the model. Two classic areas of research: finding tract shapes, and synthesizing glottal flow signals.
Perhaps a GAN can be used as a sort of correction filter at the end (style transfer, etc)?
Articulatory Inversion: area of research that analyzes voice and returns corresponding tract shapes.
Glottal Excitation Extraction: given a real voice, extract core glottal signal.
These are common topics in the speech world, but not in the context of singing.
For music: exploring vocal techniques outside of the realm of speech that are utilized in music. How can we develop models that sound good as a 4-part SATB ensemble?
"Convincing" as a musical performance but not necessarily realistic.
How to develop and dynamically generate perceptually distinct voices? Reduce the distinctness to sounding like overdubs?
This level would be most focused around physical modelling.
gesture
gesture: the means of articulating and coordinating the model to be musically coherant. Moving between vowel/fricative states and pitch, in a musically appropriate "singing" way.
Developing "interperation", "attitude", and "personality" in a musical setting. Lyricism in computer sequenced works.
Area of ongoing personal research: gesture synthesis. Building systems that melt sequencing and automation curves together. My current research explores a novel set of systems and DSP algorithms that produces audio rate line signal with timing relative to an external control signal (a periodic ramp). The lines and curves generated in this system can also dynamically adapt to tempo flucations in the external signal in realtime without any preconceived tempo information.
This level would be most focused around building smart articulation and performance systems to control the underlying singing model.
ensemble
ensemble: coordinating many voices together in a highly polyphonic setting. dynamic adaptation. things like: phrasing with context-based-awareness (tempo, micro-timings, timbre, etc) , and procedurally generated.
handling one-to-many humans to robots
creating network ensembles of many humans performing singing instruments together in V
a hoarde of robots singing Palestrina.
What this isn't
This isn't speech. Words and intelligibility aren't the point. The focus is on approaching the voice as a musical instrument. It's not about the lyrics.
Fidelity. This is not a focus in so-called "deepfakes" or voice cloning technology. Analysis techniques utilizing real voice samples, and even building models that approximate distinct singers are not out of scope. Success is measured in musical expressivity. Can we build singing digital instruments and interfaces from the ground up that are intuitive and compelling to play.
This is not NLP or TTS. There is no text to speech. There is nothing about formal language models. The focus will be on the sound itself. Not what it is saying. Text to speech is NOT a musical suitable approach for musical control.
Personal Statement
I teach computers how to sing?
Previous Academic Background
Computer Music At Berklee
More DSP and programming at Stanford
I build Musical Software Ecosystems: custom DSP engines, DSLs, live coding environments, novel interfaces and digital instruments.
Design of quirky and unique interfaces that encourage playfulness and exploration.
Why the Voice? It's the only time computer music has a sense of humor.
Why this Research?
As a composer, I have a deep fascination for getting computers to sing, and to explore the implications of what that means exactly. I would be very grateful for the opportunity to pursue this with intense study over a period of time.
Why QMUL?
I know of QMUL from my networks at CCRMA and Berklee.
Highly regarded institution for music tech at the graduate level. Lots of talented people and creative projects. I see a diverse crowd. Not just STEM people, and this is very important to me as an artist-musician-researcher.
Working Draft
See QMUL/prop.txt.