For the last year and a half, we at Asymptotic have been working with the excellent team at Daily. I’d like to share a little bit about what we’ve learned.
Daily is a real time calling platform as a service. One standard feature that users have come to expect in their calls is the ability to record them, or to stream their conversations to a larger audience. This involves mixing together all the audio/video from each participant and then storing it, or streaming it live via YouTube, Twitch, or any other third-party service.
As you might expect, GStreamer is a good fit for building this kind of functionality, where we consume a bunch of RTP streams, composite/mix them, and then send them out to one or more external services (Amazon’s S3 for recordings and HLS, or a third-party RTMP server).
I’ve written about how we implemented this feature elsewhere, but I’ll summarise briefly.
This is a slightly longer post than usual, so grab a cup of your favourite beverage, or jump straight to the summary section for the tl;dr.
As the third and final day of the PipeWire hackfest draws to a close, I thought I’d summarise some of my thoughts on the goings-on and the future.
Before I get into the details, I want to send out a big thank you to:
Christian Schaller for all the hard work of organising the event and Wim Taymans for the work on PipeWire so far (and in the future)
The GNOME Foundation, for sponsoring the event as a whole
Qualcomm, who are funding my presence at the event
Collabora, for sponsoring dinner on Monday
Everybody who attended and participate for their time and thoughtful comments
For those of you who are not familiar with it, PipeWire (previously Pinos, previously PulseVideo) was Wim’s effort at providing secure, multi-program access to video devices (like webcams, or the desktop for screen capture). As he went down that rabbit hole, he wrote SPA, a lightweight general-purpose framework for representing a streaming graph, and this led to the idea of expanding the project to include support for low latency audio.
The Linux userspace audio story has, for the longest time, consisted of two top-level components: PulseAudio which handles consumer audio (power efficiency, wide range of arbitrary hardware), and JACK which deals with pro audio (low latency, high performance). Consolidating this into a good out-of-the-box experience for all use-cases has been a long-standing goal for myself and others in the community that I have spoken to.
From a PulseAudio perspective, it has been hard to achieve the 1-to-few millisecond latency numbers that would be absolutely necessary for professional audio use-cases. A lot of work has gone into improving this situation, most recently with David Henningsson’s shared-ringbuffer channels that made client/server communication more efficient.
At the same time, as application sandboxing frameworks such as Flatpak have added security requirements of us that were not accounted for when PulseAudio was written. Examples including choosing which devices an application has access to (or can even know of) or which applications can act as control entities (set routing etc., enable/disable devices). Some work has gone into this — Ahmed Darwish did some key work to get memfd support in PulseAudio, and Wim has prototyped an access-control mechanism module to enable a Flatpak portal for sound.
All this said, there are still fundamental limitations in architectural decisions in PulseAudio that would require significant plumbing to address. With Wim’s work on PipeWire and his extensive background with GStreamer and PulseAudio itself, I think we have an opportunity to revisit some of those decisions with the benefit of a decade’s worth of learning deploying PulseAudio in various domains starting from desktops/laptops to phones, cars, robots, home audio, telephony systems and a lot more.
There are some core ideas of PipeWire that I am quite excited about.
The first of these is the graph. Like JACK, the entities that participate in the data flow are represented by PipeWire as nodes in a graph, and routing between nodes is very flexible — you can route applications to playback devices and capture devices to applications, but you can also route applications to other applications, and this is notionally the same thing.
The second idea is a bit more radical — PipeWire itself only “runs” the graph. The actual connections between nodes are created and managed by a “session manager”. This allows us to completely separate the data flow from policy, which means we could write completely separate policy for desktop use cases vs. specific embedded use cases. I’m particularly excited to see this be scriptable in a higher-level language, which is something Bastien has already started work on!
A powerful idea in PulseAudio was rewinding — the ability to send out huge buffers to the device, but the flexibility to rewind that data when things changed (a new stream got added, or the stream moved, or the volume changed). While this is great for power saving, it is a significant amount of complexity in the code. In addition, with some filters in the data path, rewinding can break the algorithm by introducing non-linearity. PipeWire doesn’t support rewinds, and we will need to find a good way to manage latencies to account for low power use cases. One example is that we could have the session manager bump up the device latency when we know latency doesn’t matter (Android does this when the screen is off).
There are a bunch of other things that are in the process of being fleshed out, like being able to represent the hardware as a graph as well, to have a clearer idea of what is going on within a node. More updates as these things are more concrete.
The Way Forward
There is a good summary by Christian about our discussion about what is missing and how we can go about trying to make a smooth transition for PulseAudio users. There is, of course, a lot to do, and my ideal outcome is that we one day flip a switch and nobody knows that we have done so.
In practice, we’ll need to figure out how to make this transition seamless for most people, while folks with custom setup will need to be given a long runway and clear documentation to know what to do. It’s way to early to talk about this in more specifics, however.
One key thing that PulseAudio does right (I know there are people who disagree!) is having a custom configuration that automagically works on a lot of Intel HDA-based systems. We’ve been wondering how to deal with this in PipeWire, and the path we think makes sense is to transition to ALSA UCM configuration. This is not as flexible as we need it to be, but I’d like to extend it for that purpose if possible. This would ideally also help consolidate the various methods of configuration being used by the various Linux userspaces.
To that end, I’ve started trying to get a UCM setup on my desktop that PulseAudio can use, and be functionally equivalent to what we do with our existing configuration. There are missing bits and bobs, and I’m currently focusing on the ones related to hardware volume control. I’ll write about this in the future as the effort expands out to other hardware.
Onwards and upwards
The transition to PipeWire is unlikely to be quick or completely-painless or free of contention. For those who are worried about the future, know that any switch is still a long way away. In the mean time, however, constructive feedback and comments are welcome.
It’s been a little over a month since I got back from Manchester, and this post should’ve come out earlier but I’ve been swamped.
The conference was absolutely lovely, the organisation was a 110% on point (serious kudos, I know first hand how hard that is). Others on Planet GNOME have written extensively about the talks, the social events, and everything in between that made it a great experience. What I would like to write about is about why this year’s GUADEC was special to me.
GNOME turning 20 years old is obviously a large milestone, and one of the main reasons I wanted to make sure I was at Manchester this year. There were many occasions to take stock of how far we had come, where we are, and most importantly, to reaffirm who we are, and why we do what we do.
And all of this made me think of my own history with GNOME. In 2002/2003, Nat and Miguel came down to Bangalore to talk about some of the work they were doing. I know I wasn’t the only one who found their energy infectious, and at Linux Bangalore 2003, they got on stage, just sat down, and started hacking up a GtkMozEmbed-based browser. The idea itself was fun, but what I took away — and I know I wasn’t the only one — is the sheer inclusive joy they shared in creating something and sharing that with their audience.
For all of us working on GNOME in whatever way we choose to contribute, there is the immediate gratification of shaping this project, as well as the larger ideological underpinning of making everyone’s experience talking to their computers better and free-er.
But I think it is also important to remember that all our efforts to make our community an inviting and inclusive space have a deep impact across the world. So much so that complete strangers from around the world are able to feel a sense of belonging to something much larger than themselves.
I am excited about everything we will achieve in the next 20 years.
(thanks go out to the GNOME Foundation for helping me attend GUADEC this year)
I’ve written a bit in my last two blog posts about the work I’ve been doing in inter-device synchronised playback using GStreamer. I introduced the library and then demonstrated its use in building video walls.
The important thing in synchronisation, of course, is how much in-sync are the streams? The video in my previous post gave a glimpse into that, and in this post I’ll expand on that with a more rigorous, quantifiable approach.
Before I start, a quick note: I am currently providing freelance consulting around GStreamer, PulseAudio and open source multimedia in general. If you’re looking for help with any of these, do get in touch.
The sync measurement setup
What is it that we are trying to measure? Let’s look at this in terms of the outcome — I have two computers, on a network. Using the gst-sync-server library, I play a stream on both of them. The ideal outcome is that the same video frame is displayed at exactly the same time, and the audio sample being played out of the respective speakers is also identical at any given instant.
As we saw previously, the video output is not a good way to measure what we want. This is because video displays are updated in sync with the display clock, over which consumer hardware generally does not have control. Besides, our eyes are not that sensitive to minor differences in timing unless images are side-by-side. After all, we’re fooling it with static pictures that change every 16.67ms or so.
Using audio, though, we should be able to do better. Digital audio streams for music/videos typically consist of 44100 or 48000 samples a second, so we have a much finer granularity than video provides us. The human ear is also fairly sensitive to timings with regards to sound. If it hears the same sound at an interval larger than 10 ms, you will hear two distinct sounds and the echo will annoy you to no end.
Measuring audio is also good enough because once you’ve got audio in sync, GStreamer will take care of A/V sync itself.
Okay, so now that we know what we want to measure, but how do we measure it? The setup is illustrated below:
Sync measurement setup illustrated
As before, I’ve set up my desktop PC and laptop to play the same stream in sync. The stream being played is a local audio file — I’m keeping the setup simple by not adding network streaming to the equation.
The audio itself is just a tick sound every second. The tick is a simple 440 Hz sine wave (A₄ for the musically inclined) that runs for for 1600 samples. It sounds something like this:
I’ve connected the 3.5mm audio output of both the computers to my faithful digital oscilloscope (a Tektronix TBS 1072B if you wanted to know). So now measuring synchronisation is really a question of seeing how far apart the leading edge of the sine wave on the tick is.
Of course, this assumes we’re not more than 1s out of sync (that’s the periodicity of the tick itself), and I’ve verified that by playing non-periodic sounds (any song or video) and making sure they’re in sync as well. You can trust me on this, or better yet, get the code and try it yourself! :)
The last piece to worry about — the network. How well we can sync the two streams depends on how well we can synchronise the clocks of the pipeline we’re running on each of the two devices. I’ll talk about how this works in a subsequent post, but my measurements are done on both a wired and wireless network.
Before we get into it, we should keep in mind that due to how we synchronise streams — using a network clock — how in-sync our streams are will vary over time depending on the quality of the network connection.
If this variation is small enough, it won’t be noticeable. If it is large (10s of milliseconds), then we may notice start to notice it as echo, or glitches when the pipeline tries to correct for the lack of sync.
In the first setup, my laptop and desktop are connected to each other directly via a LAN cable. The result looks something like this:
Sync on LAN, working well
Sync on LAN, working well, up close
Sync on LAN, slightly off
Sync on LAN, slightly off, up close
The first two images show the best case — we need to zoom in real close to see how out of sync the audio is, and it’s roughly 50µs.
The next two images show the “worst case”. This time, the zoomed out (5ms) version shows some out-of-sync-ness, and on zooming in, we see that it’s in the order of 500µs.
So even our bad case is actually quite good — sound travels at about 340 m/s, so 500µs is the equivalent of two speakers about 17cm apart.
Now let’s make things a little more interesting. With both my laptop and desktop connected to a wifi network:
Sync on wifi, okay on average
Sync on wifi, okay on average, up close
Sync on wifi, goes off on bad connection
Sync on wifi, goes off, up close
Sync on wifi, when it’s bad
Sync on wifi, when it’s good
On average, the sync can be quite okay. The first pair of images show sync to be within about 300µs.
However, the wifi on my desktop is flaky, so you can see it go off up to 2.5ms in the next pair. In my setup, it even goes off up to 10-20ms, before returning to the average case. The next two images show it go back and forth.
Why does this happen? Well, let’s take a quick look at what ping statistics from my desktop to my laptop look like:
Ping from desktop to laptop on wifi
That’s not good — you can see that the minimum, average and maximum RTT are very different. Our network clock logic probably needs some tuning to deal with this much jitter.
These measurements show that we can get some (in my opinion) pretty good synchronisation between devices using GStreamer. I wrote the gst-sync-server library to make it easy to build applications on top of this feature.
The obvious area to improve is how we cope with jittery networks. We’ve added some infrastructure to capture and replay clock synchronisation messages offline. What remains is to build a large enough body of good and bad cases, and then tune the sync algorithm to work as well as possible with all of these.
Also, Florent over at Ubicast pointed out a nice tool they’ve written to measure A/V sync on the same device. It would be interesting to modify this to allow for automated measurement of inter-device sync.
In a future post, I’ll write more about how we actually achieve synchronisation between devices, and how we can go about improving it.
Hello again, and I hope you’re having a pleasant end of the year (if you are, maybe don’t check the news until next year).
I’d written about synchronised playback with GStreamer a little while ago, and work on that has been continuing apace. Since I last wrote about it, a bunch of work has gone in:
Landed support for sending a playlist to clients (instead of a single URI)
Added the ability to start/stop playback
The API has been cleaned up considerably to allow us to consider including this upstream
The control protocol implementation was made an interface, so you don’t have to use the built-in TCP server (different use-cases might want different transports)
Made a bunch of robustness fixes and documentation
Introduced API for clients to send the server information about themselves
Also added API for the server to send video transformations for specific clients to apply before rendering
While the other bits are exciting in their own right, in this post I’m going to talk about the last two items.
For those of you who aren’t familiar with the term, a video wall is just an array of displays stacked to make a larger display. These are often used in public installations.
One way to set up a video wall is to have each display connected to a small computer (such as the Raspberry Pi), and have them play a part of the entire video, cropped and scaled for the display that is connected. This might look something like:
A 4×4 video wall
The tricky part, of course, is synchronisation — which is where gst-sync-server comes in. Since we’re able to play a given stream in sync across devices on a network, the only missing piece was the ability to distribute a set of per-client transformations so that clients could apply those, and that is now done.
In order to keep things clean from an API perspective, I took the following approach:
Clients now have the ability to send a client ID and a configuration (which is just a dictionary) when they first connect to the server
The server API emits a signal with the client ID and configuration, which allows you to know when a client connects, what kind of display it’s running, and where it is positioned
The server now has additional fields to send a map of client ID to a set of video transformations
This allows us to do fancy things like having each client manage its own information with the server dynamically adapting the set of transformations based on what is connected. Of course, the simpler case of having a static configuration on the server also works.
Since seeing is believing, here’s a demo of the synchronised playback in action:
The setup is my laptop, which has an Intel GPU, and my desktop, which has an NVidia GPU. These are connected to two monitors (thanks go out to my good friends from Uncommon for lending me their thin-bezelled displays).
The video resolution is 1920×800, and I’ve adjusted the crop parameters to account for the bezels, so the video actually does look continuous. I’ve uploaded the text configuration if you’re curious about what that looks like.
As I mention in the video, the synchronisation is not as tight than I would like it to be. This is most likely because of the differing device configurations. I’ve been working with Nicolas to try to address this shortcoming by using some timing extensions that the Wayland protocol allows for. More news on this as it breaks.
More generally, I’ve done some work to quantify the degree of sync, but I’m going to leave that for another day.
p.s. the reason I used kmssink in the demo was that it was the quickest way I know of to get a full-screen video going — I’m happy to hear about alternatives, though
Make it real
My demo was implemented quite quickly by allowing the example server code to load and serve up a static configuration. What I would like is to have a proper working application that people can easily package and deploy on the kinds of embedded systems used in real video walls. If you’re interested in taking this up, I’d be happy to help out. Bonus points if we can dynamically calculate transformations based on client configuration (position, display size, bezel size, etc.)
One thing that’s bothering me is that the video transformations are applied in software using GStreamer elements. This works fine(ish) for the hardware I’m developing on, but in real life, we would want to use OpenGL(ES) transformations, or platform specific elements to have hardware-accelerated transformations. My initial thoughts are for this to be either API on playbin or a GstBin that takes a set of transformations as parameters and internally sets up the best method to do this based on whatever sink is available downstream (some sinks provide cropping and other transformations).
Why not audio?
I’ve only written about video transformations here, but we can do the same with audio transformations too. For example, multi-room audio systems allow you to configure the locations of wireless speakers — so you can set which one’s on the left, and which on the right — and the speaker will automatically play the appropriate channel. Implementing this should be quite easy with the infrastructure that’s currently in place.
Merry Happy .
I hope you enjoyed reading that — I’ve had great responses from a lot of people about how they might be able to use this work. If there’s something you’d like to see, leave a comment or file an issue.
A lesser known, but particularly powerful feature of GStreamer is our ability to play media synchronised across devices with fairly good accuracy.
The way things stand right now, though, achieving this requires some amount of fiddling and a reasonably thorough knowledge of how GStreamer’s synchronisation mechanisms work. While we have had someexcellenttalks about these at previous GStreamer conferences, getting things to work is still a fair amount of effort for someone not well-versed with GStreamer.
As part of my work with the Samsung OSG, I’ve been working on addressing this problem, by wrapping all the complexity in a library. The intention is that anybody who wants to implement the ability for different devices on a network to play the same stream and have them all synchronised should be able to do so with a few lines of code, and the basic know-how for writing GStreamer-based applications.
I’ve started work on this already, and you can find the code in the creatively named gst-sync-server repo.
Design and API
Let’s make this easier by starting with a picture …
Let’s say you’re writing a simple application where you have two ore more devices that need to play the same video stream, in sync. Your system would consist of two entities:
A server: this is where you configure what needs to be played. It instantiates a GstSyncServer object on which it can set a URI that needs to be played. There are other controls available here that I’ll get to in a moment.
A client: each device would be running a copy of the client, and would get information from the server telling it what to play, and what clock to use to make sure playback is synchronised. In practical terms, you do this by creating a GstSyncClient object, and giving it a playbin element which you’ve configured appropriately (this usually involves at least setting the appropriate video sink that integrates with your UI).
That’s pretty much it. Your application instantiates these two objects, starts them up, and as long as the clients can access the media URI, you magically have two synchronised streams on your devices.
The keen observers among you would have noticed that there is a control entity in the above diagram that deals with communicating information from the server to clients over the network. While I have currently implemented a simple TCP protocol for this, my goal is to abstract out the control transport interface so that it is easy to drop in a custom transport (Websockets, a REST API, whatever).
The actual sync information is merely a structure marshalled into a JSON string and sent to clients every time something happens. Once your application has some media playing, the next thing you’ll want to do from your server is control playback. This can include
Changing what media is playing (like after the current media ends)
Pausing/resuming the media
“Trick modes” such as fast forward or reverse playback
The first two of these work already, and seeking is on my short-term to-do list. Trick modes, as the name suggets, can be a bit more tricky, so I’ll likely get to them after other things are done.
My hope is to see this library being used in a few other interesting use cases:
Video walls: having a number of displays stacked together so you have one giant display — these are all effectively playing different rectangles from the same video
Multiroom audio: you can play the same music across different speakers in a single room, or multiple rooms, or even group sets of speakers and play different media on different groups
Media sharing: being able to play music or videos on your phone and have your friends be able to listen/watch at the same time (a silent disco app?)
At this point, the outline of what I think the API should look like is done. I still need to create the transport abstraction, but that’s pretty much a matter of extracting out the properties and signals that are part of the existing TCP transport.
What I would like is to hear from you, my dear readers who are interested in using this library — does the API look like it would work for you? Does the transport mechanism I describe above cover what you might need? There is example code that should make it easier to understand how this library is meant to be used.
Depending on the feedback I get, my next steps will be to implement the transport interface, refine the API a bit, fix a bunch of FIXMEs, and then see if this is something we can include in gst-plugins-bad.
Feel free to comment either on the Github repository, on this blog, or via email.
And don’t forget to watch this space for some videos and measurements of how GStreamer synchronised fares in real life!
This is a quick PSA for those of you using the GStreamer binary builds for Android.
With the Android NDK r12, the default behaviour while building native code changed from building for armeabi to building for all ABIs. So if your app doesn’t specify APP_ABI in its Application.mk, you will now get an error about unsupported architectures. This was tracked as bug 770631.
The idea behind this change is that your Android app should ship versions of your native code for all supported architectures as a “universal” build, so it is accessible to as many devices as possible.
To deal with this, we now provide a universal tarball which contains binaries for all archiectures that we support. This is currently ARM, ARMv7-A, ARMv8-A (64-bit), x86, and x86-64. That leaves MIPS and MIPS64 that are not currently supported.
If you’ve been using the GStreamer Android binaries before GStreamer 1.9.2, then you should start using the universal tarball rather than the architecture-specific tarball. You will need minor updates to your native build, like we made to the player example. You probably want to put the gstAndroidRoot variable in ~/.gradle/gradle.properties instead, though.
As Sebastian announced, assuming all goes well with the universal tarballs, we will stop shipping the per-arch tarballs — they are redundant, and just take up CI and disk resources.
There are some things that I’d like for us to be able to do better. The first is that Android Studio doesn’t pick up native code with our current build approach. This is a limitation of the Android Gradle NDK plugin, which doesn’t support a custom build. This should change with Android Studio 2.2.
I would also like to integrate better with Android Studio — either be able to specify the GStreamer Android binary path in the UI (like you do for the SDK/NDK), or better yet, have it be possible to specify the dependency in Gradle, and have it be automatically pulled from the Internet. If any of you are familiar with how we can do this, please shout out!
In case you missed it — we got PulseAudio 9.0 out the door, with the echo cancellation improvements that I wrote about. Now is probably a good time for me to make good on my promise to expand upon the subject of beamforming.
As with the last post, I’d like to shout out to the wonderful folks at Aldebaran Robotics who made this work possible!
Beamforming as a concept is used in various aspects of signal processing including radio waves, but I’m going to be talking about it only as applied to audio. The basic idea is that if you have a number of microphones (a mic array) in some known arrangement, it is possible to “point” or steer the array in a particular direction, so sounds coming from that direction are made louder, while sounds from other directions are rendered softer (attenuated).
Practically speaking, it should be easy to see the value of this on a laptop, for example, where you might want to focus a mic array to point in front of the laptop, where the user probably is, and suppress sounds that might be coming from other locations. You can see an example of this in the webcam below. Notice the grilles on either side of the camera — there is a microphone behind each of these.
Webcam with 2 mics
This raises the question of how this effect is achieved. The simplest approach is called “delay-sum beamforming”. The key idea in this approach is that if we have an array of microphones that we want to steer the array at a particular angle, the sound we want to steer at will reach each microphone at a different time. This is illustrated below. The image is taken from this great article describing the principles and math in a lot more detail.
In this figure, you can see that the sound from the source we want to listen to reaches the top-most microphone slightly before the next one, which in turn captures the audio slightly before the bottom-most microphone. If we know the distance between the microphones and the angle to which we want to steer the array, we can calculate the additional distance the sound has to travel to each microphone.
The speed of sound in air is roughly 340 m/s, and thus we can also calculate how much of a delay occurs between the same sound reaching each microphone. The signal at the first two microphones is delayed using this information, so that we can line up the signal from all three. Then we take the sum of the signal from all three (actually the average, but that’s not too important).
The signal from the direction we’re pointing in is going to be strongly correlated, so it will turn out loud and clear. Signals from other directions will end up being attenuated because they will only occur in one of the mics at a given point in time when we’re summing the signals — look at the noise wavefront in the illustration above as an example.
(this section is a bit more technical than the rest of the article, feel free to skim through or skip ahead to the next section if it’s not your cup of tea!)
The devil is, of course, in the details. Given the microphone geometry and steering direction, calculating the expected delays is relatively easy. We capture audio at a fixed sample rate — let’s assume this is 32000 samples per second, or 32 kHz. That translates to one sample every 31.25 µs. So if we want to delay our signal by 125µs, we can just add a buffer of 4 samples (4 × 31.25 = 125). Sound travels about 4.25 cm in that time, so this is not an unrealistic example.
Now, instead, assume the signal needs to be delayed by 80 µs. This translates to 2.56 samples. We’re working in the digital domain — the mic has already converted the analog vibrations in the air into digital samples that have been provided to the CPU. This means that our buffer delay can either be 2 samples or 3, not 2.56. We need another way to add a fractional delay (else we’ll end up with errors in the sum).
There is a fair amount of academic work describing methods to perform filtering on a sample to provide a fractional delay. One common way is to apply an FIR filter. However, to keep things simple, the method I chose was the Thiran approximation — the literature suggests that it performs the task reasonably well, and has the advantage of not having to spend a whole lot of CPU cycles first transforming to the frequency domain (which an FIR filter requires)(edit: converting to the frequency domain isn’t necessary, thanks to the folks who pointed this out).
Now it’s time for a confession. I’m a plumber, not a DSP ninja. My delay-sum beamformer doesn’t do a very good job. I suspect part of it is the limitation of the delay-sum approach, partly the use of an IIR filter (which the Thiran approximation is), and it’s also entirely possible there is a bug in my fractional delay implementation. Reviews and suggestions are welcome!
A Better Implementation
The astute reader has, by now, realised that we are already doing a bunch of processing on incoming audio during voice calls — I’ve written in the previous article about how the webrtc-audio-processing engine provides echo cancellation, acoustic gain control, voice activity detection, and a bunch of other features.
Another feature that the library provides is — you guessed it — beamforming. The engineers at Google (who clearly are DSP ninjas) have a pretty good beamformer implementation, and this is also available via module-echo-cancel. You do need to configure the microphone geometry yourself (which means you have to manually load the module at the moment). Details are on our wiki (thanks to Tanu for that!).
How well does this work? Let me show you. The image below is me talking to my laptop, which has two microphones about 4cm apart, on either side of the webcam, above the screen. First I move to the right of the laptop (about 60°, assuming straight ahead is 0°). Then I move to the left by about the same amount (the second speech spike). And finally I speak from the center (a couple of times, since I get distracted by my phone).
The upper section represents the microphone input — you’ll see two channels, one corresponding to each mic. The bottom part is the processed version, with echo cancellation, gain control, noise suppression, etc. and beamforming.
You can also listen to the actual recordings …
… and the processed output.
Feels like black magic, doesn’t it?
The webrtc-audio-processing-based beamforming is already available for you to use. The downside is that you need to load the module manually, rather than have this automatically plugged in when needed (because we don’t have a way to store and retrieve the mic geometry). At some point, I would really like to implement a configuration framework within PulseAudio to allow users to set configuration from some external UI and have that be picked up as needed.
Nicolas Dufresne has done some work to wrap the webrtc-audio-processing library functionality in a GStreamer element (and this is in master now). Adding support for beamforming to the element would also be good to have.
The module-beamformer bits should be a good starting point for folks who want to wrap their own beamforming library and have it used in PulseAudio. Feel free to get in touch with me if you need help with that.
While I did mention a while back (almost two years ago, wow) that I was taking a break, I realised recently that I hadn’t posted an update from when I started again.
For the last year and a half, I’ve been providing freelance consulting around PulseAudio, GStreamer, and various other directly and tangentially related projects. There’s a brief list of the kind of work I’ve been involved in.
If you’re looking for help with PulseAudio, GStreamer, multimedia middleware or anything else you might’ve come across on this blog, do get in touch!
I know it’s been ages, but I am now working on updating the webrtc-audio-processing library. You might remember this as the code that we split off from the webrtc.org code to use in the PulseAudio echo cancellation module.
This is basically just the AudioProcessing module, bundled as a standalone library so that we can use the fantastic AEC, AGC, and noise suppression implementation from that code base. For packaging simplicity, I made a copy of the necessary code, and wrote an autotools-based build system around that.
Now since I last copied the code, the library API has changed a bit — nothing drastic, just a few minor cleanups and removed API. This wouldn’t normally be a big deal since this code isn’t actually published as external API — it’s mostly embedded in the Chromium and Firefox trees, probably other projects too.
Since we are exposing a copy of this code as a standalone library, this means that there are two options — we could (a) just break the API, and all dependent code needs to be updated to be able to use the new version, or (b) write a small wrapper to try to maintain backwards compatibility.
I’m inclined to just break API and release a new version of the library which is not backwards compatible. My rationale for this is that I’d like to keep the code as close to what is upstream as possible, and over time it could become painful to maintain a bunch of backwards-compatibility code.
A nicer solution would be to work with upstream to make it possible to build the AudioProcessing module as a standalone library. While the folks upstream seemed amenable to the idea when this came up a few years ago, nobody has stepped up to actually do the work for this. In the mean time, a number of interesting features have been added to the module, and it would be good to pull this in to use in PulseAudio and any other projects using this code (more about this in a follow-up post).
So if you’re using webrtc-audio-processing, be warned that the next release will probably break API, and you’ll need to update your code. I’ll try to publish a quick update guide when releasing the code, but if you want to look at the current API, take a look at the current audio_processing.h.
p.s.: If you do use webrtc-audio-processing as a dependency, I’d love to hear about it. As far as I know, PulseAudio is the only user of this library at the moment.