During the previous weeks, I've been publishing several posts talking about my project development, some literature review that I've done, app examples, etc... In some cases, I had to use concepts that were complex to explain within the context of a single blog post. Therefore, I had to summarise them a lot while saying "I'll write another post talking about that". So, that post has arrived! I hope that even though not explaining in depth those concepts, you could follow the idea of the post, however, if you perceived that you needed to know a bit more about ambisonics world, today's post is for you.
Obviously, it is impossible to explain all the background theory of ambisonics here, but at least, I'll try to describe the concepts that I think are more important to understand this particular project. Are you ready to go?
Ambisonics is a method for recording, mixing and playing back spatial audio that was invented in the 70s. It's recently become more widely used with the popularisation of VR content since it's flexible, and format agnostic. "Too much information in a couple of sentences!" Ok, no worries... to understand it better, it's worth starting from the basic spatial audio configuration: stereo.
Stereo is the fundamental spatial audio configuration since it is a 2D representation of the front image. If we want a complete 2D representation, then we would have to use a quadriphonic, 5.1 or 7.1 formats which also cover the back and the sides. All these configurations use the same amount of channels as speakers to play them back. Stereo is 2ch and uses 2 speakers, 5.1 is 6 channels and uses 6 speakers, and so on... Also, each channel is reproduced at each speaker individually. However, ambisonics doesn't. Ambisonics uses spherical harmonics (a mathematical concept that I'm not going to explain now, feel free to Google it to know more about it) to store the spatial information of a sound field. The basic ambisonics configuration uses 4 channels:
Within these four channels, we have all the necessary information to recreate a three-dimensional sound field completely. However, as I said before, having 4 channels doesn't mean that we need 4 speakers to play it back. We would need at least 4 speakers, but each speaker will reproduce a combination of the four channels. That's why we need a decoder to generate the sound that each speaker has to play back.
So, I've explained how ambisonics works - once the sound field is already in ambisonics format - but, what about placing a mono source in the ambisonics domain? In such a case, we would need an encoder; this uses mathematical formulas to add the necessary information of our mono source to each of the four channels. The amount of information added to the channels will be always related to the position of our mono source.
"In a previous post you mention ambisonic orders... what does it mean?" The four-channel example that I just explained is called 1st order ambisonics, and it is the minimum order needed to obtain a 3D representation of the sound field. These four channels are the first spherical harmonics but, as we can see in the image above, there are much more spherical harmonics that we can add. To increase the ambisonic order, a further layer of the pyramidal structure must be added each time. Therefore, 2OA has 8ch, 3OA 16ch, etc. "But... why?" As you can see in the image, the higher the order, the more channels of information there are. Each channel contributes more information about the sound field, meaning that the encoding/decoding process will be more precise.
Do you need a recap? This video from waves plug-ins explains 1OA in a very easy way:
Virtual loudspeaker technique
Ambisonics can be reproduced using a speaker array, where the speakers must be as even as possible distributed, perfectly calibrated, in an acoustically controlled room, etc... As you might think, this is a really difficult setup to have at home, and only some studios or research centres could have it. The other way to listen to an ambisonics sound field is using the virtual loudspeaker technique, where a speaker array is simulated to be reproduced using headphones.
"Wait... did you say: virtual speakers through headphones?" Indeed, let's imagine that we have a real speaker array like the one behind me in the picture. Here we can see that there are some speakers almost evenly distributed in a sphere. Using this array, we could reproduce an ambisonics sound field, right? Ok, do you remember last week's post? When I was talking about the Dummy Head? In this picture, I'm with my friend KEMAR, another kind of Dummy Head that also includes the torso. As I said last week, if we reproduce a sound in a certain point of the space, we record it through the dummy head ears, and then we play it back through a pair of headphones, we perceive that the sound is coming from the original position. In addition, we can extract the filter of that position and create a Head Related Transfer Function (HRTF), which we can apply it later to any non-processed sound and place it at that point. The result is a 2ch file that must be listened to using headphones to perceive it correctly, and this is called binaural.
"And then... what relates the Dummy Head, the HRTFs and Ambisonics?" Let's have a look at the following diagram. To reproduce an Ambisonics sound field using a speaker array, we would need: the Ambisonics sound field, a decoder (to create the different speaker signals) and the speaker array to reproduce them. To reproduce an Ambisonics sound field through headphones, on the other hand, the first stages will be the same. We need to decode the Ambisonics sound field to speaker signals but, instead of directly reproduce these signals, we'll place them to the virtual position (where that speaker should be placed in the real world) using the corresponding HRTF. Finally, we'll add all the resulting binaural signals, and we'll get a 2ch file (to be listened to using headphones). While listening to this file, the perception of the sound field will be the same as we were in the middle of the speaker array: a virtual loudspeaker array.
I assume that if you are entirely new to the topic, reading only the information given in this post might be not enough. Ambisonics is a huge topic, and HRTFs, which I just mentioned but not explained in depth, another huge one. Therefore, here I give you some links that might help you better understand spatial audio and ambisonics:
Apple's ARKit is a development platform that allows the developers to create highly detailed Augmented Reality experiences for iPhone and iPad. It was introduced with the release of iOS 11, and it can be used from iPhone 6s upwards or iPad pros released in 2017 or newer.
ARKit uses the iPhone’s cameras and motion sensors, which are a combination of several sensors like the gyroscope, the accelerometer or the magnetometer. Then, comparing the images received from the camera with the movements of the phone, it can map your environment, recognise where the walls and floor are, and establish a basic geometry of the space. With this, you can place virtual objects in your room, and see them through the camera as they were real. Do you remember Pokemon Go? Where virtual pokemon characters appear on your phone's camera, interacting with your environment? That was AR, and it was possible to develop thanks to ARKit (and Google's equivalent for Android, of course...).
In my first tests, I was trying to familiarise myself with the ARKit functions, and get some quick results and ideas of what I could do for the project. Specifically, I was testing the world tracking feature, which identifies features of the environment and analyses them between the different frames to get position data. Therefore, you can know where your phone is placed at each moment using an internal function that gives you the X, Y and Z coordinates (in meters) in relation to a starting point.
Just with a simple command to get this information, an OSC framework, and a bit of imagination (that's always necessary...), I created a Virtual Dummy Head app:
virtual dummy head
In order to test the AR capabilities that I explained before, I developed a test app to create a virtual dummy head. "Hold on... what do you mean with a dummy head?" If you aren't familiarised with audio terminology, it might sound weird, that's true. A dummy head (image) is a real-size model of a human head with a really accurate reproduction of the pinnae (the external part of the ears). There is a microphone inside each ear and, thus, the resulting capture allows the user perceive the sound as the mannequin would perceive it. It includes localisation on the vertical and horizontal planes, as well as distance perception.
What I wanted to do with this test app was to place a virtual dummy head at some point of the room, and be able to move around getting the relative position to it. To do it, I downloaded a mannequin head model from SketchFab, and I used some basic trigonometry to get the azimuth and elevation angles. As you can see in the following video, the result is quite precise, even with fast movements.
This Virtual Dummy Head looks cool, isn't it? But, how can we use these values to manipulate sound? And... how can we simulate the microphones that should be inside a real Dummy Head? As I said at the last post, a Max MSP package could be really useful to do some tests like this. Specifically, I used the Ambisonics Externals for Max MSP created by the Zurich University of the Arts.
The iPhone app sends Open Sound Control (OSC) messages via WiFi each 10/second. These messages contain the azimuth, elevation and distance information of the phone at that precise moment. Then, a Max MSP patch receives these messages and pass the values to an Ambisonics encoder object, which places a sound source at that position. Finally, an ambisonics decoder creates 50 virtual loudspeakers almost evenly distributed. "But you said that it would be a simulation of a real Dummy Head, and therefore, it would be reproduced through headphones, right?" Yes, continue reading...
After the decoding process, the 50 virtual speaker feeds are convoluted with the Head Related Transfer Functions (HRTFs) of a real KU100 measured during the SADIE II project. I would need another post to explain correctly what an HRTF is and how virtual loudspeakers technique works. For now, just imagine that a sound was reproduced from different angles and captured with a real KU100, then, knowing how this sound is perceived by the Dummy Head at a particular angle, we can extract a filter and apply it to another sound, to simulate that this new sound is placed at the position of the first one. (No worries if you've understood anything of this lasts paragraphs, I'll publish another post soon explaining it slowly).
To Summarise: the iPhone app sends OSC messages to a Max patch, where an ambisonics encoder encodes a sound source using the received position values. Finally, that ambisonics feeds are decoded and convoluted with HRTFs of a real KU100. Therefore, the resulting sound is virtually the same as if a KU100 was there!
Ready for the demo? Put your headphones on and have a look at the following video then!
Since the last post, I've been immersed in the first stage of the project: research and literature review. Controlling spatial audio is not a new topic, and therefore, I wanted to analyse the existing methods that are already on the market. During these days, I've been reviewing several projects and products, from software to hardware, and even some that use air gesture control or... smartphones!
Let's have a look at what I found:
Spatial audio controller using software is probably the most extended method at the moment. There are lots of open source plug-ins as well as paid packages which offer a relatively easy way to create and control spatial audio. As my project pretends to create a DAW based tool, I've been mainly focusing on plug-ins. However, Max MSP packages will also be useful for testing proposes.
I've found almost 20 different products. Some that are Open Source (usually created by Universities) and others under a commercial license. All of them have similarities in the effects included. Here I've selected some of them:
As it might be expected, the commercial plug-ins have better user interfaces, promotional videos, customer support, etc... Nevertheless, the Open source ones have the same audio quality (or even better in some cases). For example, AmbiX and SPARTA, which both are Open source, offer ambisonics encoding/decoding up to 7th order. On the other hand, Waves 360, which is a commercial one, only offers 1st order (more about orders in a further post, but let's assume that the higher the order, the higher the spatial audio quality). For an inexperienced user that is looking for something easy to install, easy to use, and easy to learn, Waves is probably the best choice but, for an experienced user, AmbiX or SPARTA will offer more options and flexibility.
Hardware options are in most of the cases controlling parameters of a software-based tool, and thus, the spatial audio processing is still being software-based. However, hardware control could be handy to control several settings at the same time easily. Moreover, for most of the users, it is always better to have some touchable controls rather than using the mouse.
Exploring new hardware-based options, the BBC carried out a research on the use of a haptic feedback device for sound source control in spatial audio systems . In this study, they used this 3D joystick (picture) which had motors to recall its position, or force the user to do some predefined movements. They explored several modes of controlling spatial audio using the capabilities of this device, which spatial audio controls are more likely to control using hardware, and the options that the motors could offer. It is a very interesting project, and I really encourage you to have a look at their their paper.
mid-Air GESTURAL control
Where I found more projects is using Mid-air gestural control. This means that we control audio without touching any surface or device, just moving our hands.
Several papers have been researching different techniques to control audio using mid-air gesture [2, 3, 4], using a hand tracker device called Leap Motion. This device can track the position of each finger, and give six degrees of freedom to the user. One commercial product that includes this technology is the Fairlight 3DAW, which integrates a Leap Motion in a conventional mixing desk. Difficult to comprehend? Probably it is better if you have a look at the following video:
Other approaches like DearVR, are using VR glasses and VR controllers, to control sources in a virtual world (as the video I embedded in the Project overview post). In this case, you can be immersed in the same virtual space where the mix will happen, and you can remotely control the sound sources using the VR controllers.
Last but not least, I found a project that is controlling spatial audio using smartphones . Their approach is quite similar to my original idea, using gyroscope and accelerometer sensors to move a sound. However, their project is server-based, instead of DAW-based. In their case, the phone is sending messages to a server, which processes the audio for speaker reproduction.
and my approach?
I'm also going to use smartphones to control spatial audio, but my approach will be slightly different. In my case, I'm going to explore the native AR capabilities that iOS offers, implementing and evaluating several control methods in an iPhone app. In addition, my app will control the parameters of an ambisonics-based DAW plug-in. Therefore, it may be reproduced through speakers but also through headphones.
This week I've also been developing a simple app to try some of these AR capabilities, and next week, I'm going to write a post sharing my first thoughts and tests. Stay tuned to know more about it!!
After this extensive review (not just what I posted here, believe me that I found much more information than expected...), I realised that the audio processing part (software) is widely covered by lots of companies and University research groups. Thus, it doesn't make sense to try to create a new plug-in and pretend to do something better than the already existing ones. At least, not within this four-month project...
However, I can focus my research on the control of spatial audio, and evaluate different techniques using smartphones. The only project that uses them is only using a few sensors and just one type of movement. Therefore, my project could explore more ways to control spatial audio, especially those related to the iOS native AR capabilities.
 Melchior, Frank, Chris Pike, Matthew Brooks, and Stuart Grace. 2013. “On the Use of a Haptic Feedback Device for Sound Source Control in Spatial Audio Systems.” In Audio Engineering Society Convention 134. Audio Engineering Society.
 Gelineck, Steven, and Dannie Korsgaard. 2015. “An Exploratory Evaluation of User Interfaces for 3d Audio Mixing.” In Audio Engineering Society Convention 138. Audio Engineering Society.
 Quiroz, Diego. 2018. “A Mid-Air Gestural Controller for the Pyramix® 3D Panner.” In Audio Engineering Society Conference: 2018 AES International Conference on Spatial Reproduction-Aesthetics and Science. Audio Engineering Society.
 Gelineck, Steven, and Dan Overholt. 2015. “Haptic and Visual Feedback in 3D Audio Mixing Interfaces.” In Proceedings of the Audio Mostly 2015 on Interaction With Sound, 14. ACM.
 Foss, Richard, and Sean Devonpor. 2018. “An Immersive Audio Control System Using Mobile Devices and Ethernet AVB-Capable Speakers.” Journal of the Audio Engineering Society. Audio Engineering Society 66 (9): 724–33.
This project is my Master thesis of the MSc in Audio and Music Technology at the University of York, and it has Abbey Road studios as Industry Partner. I started to think about it approximately one month ago, when I was assigned to this topic. However, now I'm starting to research full-time, and I will be working on it until the end of August.
If I tell you that the formal title is: "DAW based spatial audio creation using smartphones", probably your reply will be something like: "Sounds cool, but... what does it mean?" Let's start from the begging then:
A Digital Audio Workspace (DAW) is a generic software categorisation within we can find commercial software like ProTools (image), Cubase, Reaper, LogicPro etc... With them, the user has an environment to sequence, record, edit, and mix audio or music.
This kind of software can manage inputs and outputs of an external soundcard (to work with analogue equipment, for instance), or use third-party effects (also called plug-ins) to generate or modify sound within the same environment. Therefore, a DAW is the base of all digital studios, from the smallest home-studio to Abbey Road's studio one. A DAW-based tool, thus, is whatever tool that can interact with a DAW (generally in a plug-in).
"...Spatial audio creation..."
Spatial audio is a really vague name but, for now, just assume that spatial audio means the sound is perceived in the space (as the real life). Other ways to say it is 3D audio or immersive sound. Spatial audio can be reproduced using an array of speakers (like Dolby Atmos), but it is also possible to reproduce through headphones. Do you want to try it? What about going to the virtual barbershop?
Several commercial brands like Waves, Facebook, or SSA, as well as independent developers like Matthias kronlachner or Ambisonic ToolKit, have released different tools to work with (and create) spatial audio. Some of these tools are used, for instance, to place sources in the space, to add spatial reverberation, or to apply other effects. Moreover, most of these tools are also DAW based, and then, why and what do I want to research in this area?
Indeed, the last part of the sentence is the answer. At it is shown in the image above (from the ambix plug-ins suite) all of the previously presented tools are using 2D user interfaces to work with 3D audio. Even though they simulate 3D objects, such as the ambix encoder (bottom-right), sometimes it is difficult to work with them, especially concerning elevation, width or distance.
"And what's happening with the smartphones?" Well, nowadays almost everyone has at least one smartphone with them; And, among others, it has lots of sensors like the gyroscope, the accelerometer or the camera that could be used to control spatial audio. An app then, can interpret these sensors information, and send OSC messages (a standard network communication for sound) to a plug-in in the DAW. After that, the plugin can translate the received information to place a sound in the 3D space, change the reverberation, the EQ, or whatever else.
Now that we know the background and hopefully the title is better understood, let's talk about the specific aims and objectives of the project:
aims and objectives
The main objective is developing an iPhone app which lets the user control spatial audio parameters in a DAW. Therefore, also a plug-in needs to be developed to receive the app messages and process the audio.
With these objectives, the project has two research areas:
Spatial audio workflow
Before controlling spatial audio parameters with a smartphone, we have to know which parameters we want to control in our plug-in. Therefore, the first research area is the spatial audio workflow, where we can find questions like:
Spatial audio control using smartphones
Finally, the second research area is related to gesture control and smartphones. An iPhone has lots of sensors and Augmented Reality capabilities to explore and thus, there are also several questions to ask:
To summarise, the main concept of the project is to enable the user to control spatial audio parameters in a DAW environment, but without using only the mouse and the keyboard. Instead, investigate new and easier ways to control it using the different capabilities of smartphone sensors.
A good example of spatial audio creation without the keyboard and the mouse is this product called DearVR Spatial connect. It is using a VR headset and VR controllers to move sound sources in a VR environment. This is what they call 'the mixing console of the future'.
This blog aims to share the progress of my Master thesis project with anyone who might be interested. Maybe someone could be interested in all the project, perhaps someone else could be interested in a post of a particular topic or, why not, maybe you are a friend of mine, and you just want to keep in touch with me during these busy days (sorry about that... I'll be back soon!). Whatever of the previous reasons, you are really welcome to read as many posts as you like and, of course, feel free to comment, discuss or ask anything you want!
I’m currently researching on new DAW-based tools to work with Spatial Audio creation, using gesture control with smartphones. This project has Abbey Road as an industry partner, and it is supervised by Dr Gavin Kearney and Prof Andy Hunt.
I am passionate about Audio and Music Technology, and I really love the world of sound, audio and music. After finishing the Sonology Bachelor, and working for two years in the media industry, I decided to progress to a more scientific and engineering-based career. That is why I am currently stuying an MSc in Audio and Music Technology at the University of York, where I discovered my passion for academic research, especially in the field of Spatial Audio and Virtual Acoustics.
Aquest blog pretén compartir el progrés del meu treball de final de màster amb tothom qui hi pugui estar interessat. Potser hi ha algú que està interessat en tot el projecte, potser algú altre està interessat en algun tema en concret o, per què no, potser ets algun amic meu que vol simplement vol estar en contacte amb mi durant aquests dies que no hi sóc (perdó, tornaré aviat...). Sigui quina sigui la raó, sou tots molt benvinguts a llegir tants posts com vulgueu i, per descomptat, no dubteu en comentar, discutir, o preguntar el que us vingui de gust!