GStreamer + PipeWire: A Todo List

I wrote about our time at the GStreamer Conference in October, and one important thing I was able to do is spend some time with all-around great guy George reflecting on where the GStreamer plugins for PipeWire are, and what we need to do to get them to a rock-solid state.

This is a summary of our conversation, in the form of a to-do list of sorts…

Status Quo

Currently, we have two elements: pipewiresrc and pipewiresink. The two plugins work with both audio and video, and instantiate a PipeWire capture and playback stream, respectively. The stream, as with any PipeWire client, appears as a node in the PipeWire.

Buffers are managed in the GStreamer pipeline using bufferpools, and recently Wim re-enabled exposing the stream clock as a GStreamer clock.

There have been a number of issues that have cropped up over time, and we’ve been plugging away at addressing them, but it was worth stepping back and looking at the whole for a bit.

Use Cases

The straightforward uses of these elements might be to represent client streams: pipewiresrc might connect to an audio capture device (like a microphone), or video capture device (like a webcam), and provide the data for downstream elements to consume. Similarly pipewiresink might be used to play audio to the system output (speakers or headphones, perhaps).

Because of the flexibility of the PipeWire API, these elements may also be used to provide a virtual capture or playback device though. So pipewiresrc might provide a virtual audio sink, which applications could connect to to stream audio over the network (like say a WebRTC stream).

Conversely, it is possible to use pipewiresink to provide a virtual capture device – for example, the pipeline might generate a video stream and expose it a virtual camera for other applications to use.

We might even combine the two cases, one might connect to a webcam as a client, apply some custom video processing, and then expose that stream back as a virtual camera source as easily as:

So we have a minor combinatorial explosion across 3 axes, and all combinations are valid:

  • pipewiresrc vs. pipewiresink
  • audio vs. video
  • stream vs. virtual device

For each of these combinations, we might have different behaviour across the various issues below.

Split ’em up?

Before we look at specific issues, it is worth pointing out that the PipeWire elements are unusual in that they support both audio and video with the same code. This seems like a tantalisingly elegant idea, and it’s quite neat that we are able to get this far with this unified approach.

However, as we examine the specific issues we are seeing, it does seem to emerge that the audio and video paths diverge in several ways. It may be time to consider whether the divergence merits just splitting them up into separate audio and video elements.

Linking

The first issue that comes to mind is how we might want PipeWire or WirePlumber to manage linking the nodes from the GStreamer pipeline with other nodes (devices or streams).

For the playback/capture stream use-cases, we would want the nodes to automatically be connected to a sink/source node when the GStreamer pipeline goes to the PAUSED or PLAYING state, and for that link to be torn down when leaving those states. It might be possible for the link to “move” if, for example, the default playback or capture device changes, though a “move” is really the removal of the current link with a new link following.

For the virtual device use-cases, the pipeline state should likely follow the link state. That is, when a node is connected to our virtual device, we want the GStreamer pipeline to start producing/consuming data, and when disconnected, it should go back to “sleep”, possibly running again later.

The latter is something that a GStreamer application using these plugins might have to manage manually, but simplifying this and supporting this via gst-launch-1.0 for easy command-line use would be nice to have.

There are already the beginnings of support for such usage via the provide property on pipewiresink, but more work is needed for this to make this truly usable.

Bufferpools

Closely related to linking are buffers and bufferpools, as the process of linking nodes is what makes buffers for data exchange available to PipeWire nodes.

While bufferpools are a valuable concept for memory efficiency and avoiding unnecessary memcpy()s, they come with some complexity overhead in managing the pipeline. For one, as the number of buffers in a bufferpool is limited, it is possible to exhaust the set of buffers (with a large queue for example).

There are also some lifecycle complexities that arise from links coming and going, as the corresponding buffers also then go away from under us, something that GStreamer bufferpools are not designed for.

A solution to the first problem might be to avoid using bufferpools for some cases (for example, they might not be very valuable for audio). The solution to the lifecycle problem is a trickier one, and no clear answer is apparent yet, at least with the APIs as they stand.

We might also need to support resizing bufferpools for some cases, and that is not something that is easy to support with how buffer management currently happens in PipeWire (the stream API does not really give us much of a handle on this).

Formats

In order to support the various use-cases, we want to be able to support both a fixed format (if we know what we are providing), or a negotiated format (if we can adapt in the GStreamer pipeline based on what PipeWire has/wants).

There is also a large surface area of formats that PipeWire supports that we need to make sure we support well:

  • There are known issues with some planar video formats being presented correctly from pipewiresrc
  • We do not expose planar audio formats, although both GStreamer and PipeWire support them
  • Support for DSD and passthrough audio (e.g. Dolby/DTS over HDMI) needs to be wired up
  • Support for compressed formats (we added PipeWire support for decode + render on a DSP)

Rate matching

While Wim recently added some rate matching code to pipewiresink, there is work to be done to make sure that if there is skew between the GStreamer pipeline’s data rate and the audio device rate, we can use PipeWire’s rate adaptation features to compensate for such skew. This should work in both pipewiresink and pipewiresrc.

For some background on this topic, check out my talk on clock rate matching from a couple of months ago.

Device provider conflicts

While we are improving the out-of-the-box experience of these elements, unfortunately the PipeWire device provider currently supersedes all others (the GStreamer Device Provider API allows for discovering devices on the system and the elements used to access them).

The higher rank might make sense for video (as system integrators likely want to start preferring PipeWire elements over V4L2), but it can lead to a bad experience for audio (the PulseAudio elements work better today via PipeWire’s PulseAudio emulation layer).

We might temporarily drop the rank of PipeWire elements for audio to avoid autoplugging them while we fix the problems we have.

Probing formats

We create a “probe” stream in pipewiresink while getting ready to play audio, in order to discover what formats the device we would play to supports. This is required to detect supported formats and make decisions about whether to decode in GStreamer, what sample rate and format are preferred, etc.

Unfortunately, that also causes a “false” playback device startup sequence, which might manifest as a click or glitch on some hardware. Having a way to set up a probe that does not actually open the device would be a nice improvement.

Player state

There are a couple of areas where policy actions do not always surface well to the application/UI layer. One instance of this is where a stream is “corked” (maybe because only one media player should be active at a time) – we want to let the player know it has been paused, so it can update its state and let the UI know too. There is limited infrastructure for this already, via GST_MESSAGE_REQUEST_STATE.

Also, more of a session management (i.e. WirePlumber / system integration) thing, we do not really have the concept of a system-wide media player state. This would be useful if we want to exercise policy like “don’t let any media player play while we’re on a call”, and have that state be consistent across UI interactions (i.e. hitting play during a call does not trigger playback, maybe even lets the user know why it’s not working / provides an override).

GStreamer Conference 2024

All of us at Asymptotic are back home from the exciting week at GStreamer Conference 2024 in Montréal, Canada last month. It was great to hang out with the community and see all the great work going on in the GStreamer ecosystem.

Montréal sunsets are 😍

There were some visa-related adventures leading up to the conference, but thanks to the organising team (shoutout to Mark Filion and Tim-Philipp Müller), everything was sorted out in time and Sanchayan and Taruntej were able to make it.

This conference was also special because this year marks the 25th anniversary of the GStreamer project!

Happy birthday to us! 🎉

Talks

We had 4 talks at the conference this year.

GStreamer & QUIC (video)

Sancyahan speaking about GStreamer and QUIC

Sanchayan spoke about his work with the various QUIC elements in GStreamer. We already have the quinnquicsrc and quinquicsink upstream, with a couple of plugins to allow (de)multiplexing of raw streams as well as an implementation or RTP-over-QUIC (RoQ). We’ve also started work on Media-over-QUIC (MoQ) elements.

This has been a fun challenge for us, as we’re looking to build out a general-purpose toolkit for building QUIC application-layer protocols in GStreamer. Watch this space for more updates as we build out more functionality, especially around MoQ.

Clock Rate Matching in GStreamer & PipeWire (video)

Arun speaking about PipeWire delay-locked loops
Photo credit: Francisco

My talk was about an interesting corner of GStreamer, namely clock rate matching. This is a part of live pipelines that is often taken for granted, so I wanted to give folks a peek under the hood.

The idea of doing this talk was was born out of some recent work we did to allow splitting up the graph clock in PipeWire from the PTP clock when sending AES67 streams on the network. I found the contrast between the PipeWire and GStreamer approaches thought-provoking, and wanted to share that with the community.

GStreamer for Real-Time Audio on Windows (video)

Next, Taruntej dove into how we optimised our usage of GStreamer in a real-time audio application on Windows. We had some pretty tight performance requirements for this project, and Taruntej spent a lot of time profiling and tuning the pipeline to meet them. He shared some of the lessons learned and the tools he used to get there.

Simplifying HLS playlist generation in GStreamer (video)

Sanchayan also walked us through the work he’s been doing to simplify HLS (HTTP Live Streaming) multivariant playlist generation. This should be a nice feature to round out GStreamer’s already strong support for generating HLS streams. We are also exploring the possibility of reusing the same code for generating DASH (Dynamic Adaptive Streaming over HTTP) manifests.

Hackfest

As usual, the conference was followed by a two-day hackfest. We worked on a few interesting problems:

  • Sanchayan addressed some feedback on the QUIC muxer elements, and then investigated extending the HLS elements for SCTE-35 marker insertion and DASH support

  • Taruntej worked on improvements to the threadshare elements, specifically to bring some ts-udpsrc element features in line with udpsrc

  • I spent some time reviewing a long-pending merge request to add soft-seeking support to the AWS S3 sink (so that it might be possible to upload seekable MP4s, for example, directly to S3). I also had a very productive conversation with George Kiagiadakis about how we should improve the PipeWire GStreamer elements (more on this soon!)

All in all, it was a great time, and I’m looking forward to the spring hackfest and conference in the the latter part next year!

GStreamer and WebRTC HTTP signalling

The WebRTC nerds among us will remember the first thing we learn about WebRTC, which is that it is a specification for peer-to-peer communication of media and data, but it does not specify how signalling is done.

Or put more simply, if you want call someone on the web, WebRTC tells you how you can transfer audio, video and data, but it leaves out the bit about how you make the call itself: how do you locate the person you’re calling, let them know you’d like to call them, and a few following steps before you can see and talk to each other.

WebRTC signalling
WebRTC signalling

While this allows services to provide their own mechanisms to manage how WebRTC calls work, the lack of a standard mechanism means that general-purpose applications need to individually integrate each service that they want to support. For example, GStreamer’s webrtcsrc and webrtcsink elements support various signalling protocols, including Janus Video Rooms, LiveKit, and Amazon Kinesis Video Streams.

However, having a standard way for clients to do signalling would help developers focus on their application and worry less about interoperability with different services.

Standardising Signalling

With this motivation, the IETF WebRTC Ingest Signalling over HTTPS (WISH) workgroup has been working on two specifications:

(author’s note: the puns really do write themselves :))

As the names suggest, the specifications provide a way to perform signalling using HTTP. WHIP gives us a way to send media to a server, to ingest into a WebRTC call or live stream, for example.

Conversely, WHEP gives us a way for a client to use HTTP signalling to consume a WebRTC stream – for example to create a simple web-based consumer of a WebRTC call, or tap into a live streaming pipeline.

WHIP and WHEP
WHIP and WHEP

With this view of the world, WHIP and WHEP can be used both for calling applications, but also as an alternative way to ingest or play back live streams, with lower latency and a near-ubiquitous real-time communication API.

In fact, several services already support this including Dolby Millicast, LiveKit and Cloudflare Stream.

WHIP and WHEP with GStreamer

We know GStreamer already provides developers two ways to work with WebRTC streams:

  • webrtcbin: provides a low-level API, akin to the PeerConnection API that browser-based users of WebRTC will be familiar with

  • webrtcsrc and webrtcsink: provide high-level elements that can respectively produce/consume media from/to a WebRTC endpoint

At Asymptotic, my colleagues Tarun and Sanchayan have been using these building blocks to implement GStreamer elements for both the WHIP and WHEP specifications. You can find these in the GStreamer Rust plugins repository.

Our initial implementations were based on webrtcbin, but have since been moved over to the higher-level APIs to reuse common functionality (such as automatic encoding/decoding and congestion control). Tarun covered our work in a talk at last year’s GStreamer Conference.

Today, we have 4 elements implementing WHIP and WHEP.

Clients

  • whipclientsink: This is a webrtcsink-based implementation of a WHIP client, using which you can send media to a WHIP server. For example, streaming your camera to a WHIP server is as simple as:
  • whepclientsrc: This is work in progress and allows us to build player applications to connect to a WHEP server and consume media from it. The goal is to make playing a WHEP stream as simple as:

The client elements fit quite neatly into how we might imagine GStreamer-based clients could work. You could stream arbitrary stored or live media to a WHIP server, and play back any media a WHEP server provides. Both pipelines implicitly benefit from GStreamer’s ability to use hardware-acceleration capabilities of the platform they are running on.

GStreamer WHIP/WHEP clients
GStreamer WHIP/WHEP clients

Servers

  • whipserversrc: Allows us to create a WHIP server to which clients can connect and provide media, each of which will be exposed as GStreamer pads that can be arbitrarily routed and combined as required. We have an example server that can play all the streams being sent to it.

  • whepserversink: Finally we have ongoing work to publish arbitrary streams over WHEP for web-based clients to consume this media.

The two server elements open up a number of interesting possibilities. We can ingest arbitrary media with WHIP, and then decode and process, or forward it, depending on what the application requires. We expect that the server API will grow over time, based on the different kinds of use-cases we wish to support.

GStreamer WHIP/WHEP server
GStreamer WHIP/WHEP server

This is all pretty exciting, as we have all the pieces to create flexible pipelines for routing media between WebRTC-based endpoints without having to worry about service-specific signalling.

If you’re looking for help realising WHIP/WHEP based endpoints, or other media streaming pipelines, don’t hesitate to reach out to us!

Asymptotic: A 2023 Review

It’s been a busy few several months, but now that we have some breathing room, I wanted to take stock of what we have done over the last year or so.

This is a good thing for most people and companies to do of course, but being a scrappy, (questionably) young organisation, it’s doubly important for us to introspect. This allows us to both recognise our achievements and ensure that we are accomplishing what we have set out to do.

One thing that is clear to me is that we have been lagging in writing about some of the interesting things that we have had the opportunity to work on, so you can expect to see some more posts expanding on what you find below, as well as some of the newer work that we have begun.

(note: I write about our open source contributions below, but needless to say, none of it is possible without the collaboration, input, and reviews of members of the community)

WHIP/WHEP client and server for GStreamer

If you’re in the WebRTC world, you likely have not missed the excitement around standardisation of HTTP-based signalling protocols, culminating in the WHIP and WHEP specifications.

Tarun has been driving our client and server implementations for both these protocols, and in the process has been refactoring some of the webrtcsink and webrtcsrc code to make it easier to add more signaller implementations. You can find out more about this work in his talk at GstConf 2023 and we’ll be writing more about the ongoing effort here as well.

Low-latency embedded audio with PipeWire

Some of our work involves implementing a framework for very low-latency audio processing on an embedded device. PipeWire is a good fit for this sort of application, but we have had to implement a couple of features to make it work.

It turns out that doing timer-based scheduling can be more CPU intensive than ALSA period interrupts at low latencies, so we implemented an IRQ-based scheduling mode for PipeWire. This is now used by default when a pro-audio profile is selected for an ALSA device.

In addition to this, we also implemented rate adaptation for USB gadget devices using the USB Audio Class “feedback control” mechanism. This allows USB gadget devices to adapt their playback/capture rates to the graph’s rate without having to perform resampling on the device, saving valuable CPU and latency.

There is likely still some room to optimise things, so expect to more hear on this front soon.

Compress offload in PipeWire

Sanchayan has written about the work we did to add support in PipeWire for offloading compressed audio. This is something we explored in PulseAudio (there’s even an implementation out there), but it’s a testament to the PipeWire design that we were able to get this done without any protocol changes.

This should be useful in various embedded devices that have both the hardware and firmware to make use of this power-saving feature.

GStreamer LC3 encoder and decoder

Tarun wrote a GStreamer plugin implementing the LC3 codec using the liblc3 library. This is the primary codec for next-generation wireless audio devices implementing the Bluetooth LE Audio specification. The plugin is upstream and can be used to encode and decode LC3 data already, but will likely be more useful when the existing Bluetooth plugins to talk to Bluetooth devices get LE audio support.

QUIC plugins for GStreamer

Sanchayan implemented a QUIC source and sink plugin in Rust, allowing us to start experimenting with the next generation of network transports. For the curious, the plugins sit on top of the Quinn implementation of the QUIC protocol.

There is a merge request open that should land soon, and we’re already seeing folks using these plugins.

AWS S3 plugins

We’ve been fleshing out the AWS S3 plugins over the years, and we’ve added a new awss3putobjectsink. This provides a better way to push small or sparse data to S3 (subtitles, for example), without potentially losing data in case of a pipeline crash.

We’ll also be expecting this to look a little more like multifilesink, allowing us to arbitrary split up data and write to S3 directly as multiple objects.

Update to webrtc-audio-processing

We also updated the webrtc-audio-processing library, based on more recent upstream libwebrtc. This is one of those things that becomes surprisingly hard as you get into it — packaging an API-unstable library correctly, while supporting a plethora of operating system and architecture combinations.

Clients

We can’t always speak publicly of the work we are doing with our clients, but there have been a few interesting developments we can (and have spoken about).

Both Sanchayan and I spoke a bit about our work with WebRTC-as-a-service provider, Daily. My talk at the GStreamer Conference was a summary of the work I wrote about previously about what we learned while building Daily’s live streaming, recording, and other backend services. There were other clients we worked with during the year with similar experiences.

Sanchayan spoke about the interesting approach to building SIP support that we took for Daily. This was a pretty fun project, allowing us to build a modern server-side SIP client with GStreamer and SIP.js.

An ongoing project we are working on is building AES67 support using GStreamer for FreeSWITCH, which essentially allows bridging low-latency network audio equipment with existing SIP and related infrastructure.

As you might have noticed from previous sections, we are also working on a low-latency audio appliance using PipeWire.

Retrospective

All in all, we’ve had a reasonably productive 2023. There are things I know we can do better in our upstream efforts to help move merge requests and issues, and I hope to address this in 2024.

We have ideas for larger projects that we would like to take on. Some of these we might be able to find clients who would be willing to pay for. For the ideas that we think are useful but may not find any funding, we will continue to spend our spare time to push forward.

If you made this this far, thank you, and look out for more updates!

To Conference Organisers Everywhere…

(well, not exactly everywhere …)

This is not an easy post for me to write, being a bit of a criticism / “you can do better” note for organisers of conferences that cater to a global community.

It’s not easy because most of the conferences I attend are community driven, and I have helped organise community conferences in the past. It is a thankless job, a labour of love, and you mostly do not get to enjoy the fruits of that labour as others do.

The problem is that these conferences end up inadvertently excluding members who live in, for lack of a better phrase, the Global South.

Visas

It always surprises me when I meet someone who doesn’t realise that I can’t just book tickets to go anywhere in the world. Not because this is information that everyone should be aware of, but because this is such a basic aspect of travel for someone like me. As a holder of an Indian passport, I need to apply for a visa to travel to … well most countries.

The list of countries that require a visa are clearly defined by post-colonial geopolitics, this is just a fact of life and not something I can do very much about.

Getting a Visa

Applying for a visa is a cumbersome and intrusive process that I am now used to. The process varies from country to country, but it’s usually something like:

  • Get an invitation letter from conference organisers
  • Book all the tickets and accommodation for the trip
  • Provide bank statements for 3-6 months, income tax returns for ~3 years (in India, those statements need attestation by the bank)
  • Maybe provide travel and employment history for the past few years (how many years depends on the country)
  • Get an appointment from the embassy of the country you’re traveling to (or their service provider)
  • Submit your passport and application
  • Maybe provide documentation that was not listed on the provider’s website
  • Wait for your passport with visa (if granted) to be mailed back to you

The duration of visa (that is how long you can stay in the country) depends on the country.

In the EU, I am usually granted a visa for the exact dates of travel (so there is no flexibility to change plans). The UK allows you to pay more for a longer visa.

The US and Canada grant multi-year visas that allow one to visit for up to 6 months by default (in the US, whether you are permitted to enter and how long you may stay are determined by the person at the border).

Timelines

Now we get to the crux of the problem: this process can take anywhere from a few days (if you are very lucky) to a few months (if you are not).

Appointments are granted by the embassy or the third party that countries delegate application collection to, and these may or may not be easily available. Post-pandemic, I’ve seen that several embassies just aren’t accepting visitor visa appointments or have a multi-month wait.

If you do get an appointment, the processing time can vary again. Sometimes, it’s a matter of a few days, sometimes a few weeks. A lot of countries I have applied to recommend submitting your application at least 6 weeks in advance (this is from the date of your visa appointment which might be several weeks in the future).

Conference Schedules

If you’re organising a conference, there are a few important dates:

  • When the conference dates are announced
  • When the call for participation goes out
  • When it ends
  • When speakers are notified
  • The conference itself

These dates are based on a set of complex factors — venue availability and confirmation, literally writing and publishing all the content of the website, paper committee availability, etc.

But if you’re in my position, you need at least 2-3 months between the first and the last step. If your attendance is conditional on speaking at the conference (for example, if your company will only sponsor you if you’re speaking), then you need a minimum of 2-3 months between when speakers are notified and the conference starts.

From what I see, this is not something that is top-of-mind for conference organisers. That may happen for a host of perfectly understandable reasons, but it also has a cost to the community and individuals who might want to participate.

Other Costs

Applying for a visa costs money. This can be anything from a few hundred to over a 1000 US dollars.

It also costs you time — filling in the application, getting all the documentation in place, getting a physical visa photo (must be no older than 6 months), traveling to an appointment, waiting in line, etc. This can easily be a matter of a day if not more.

Finally, there is an emotional cost to all this — there is constant uncertainty during the process, and a visa rejection means every visa you apply for thereafter needs you to document that rejection and reason. And you may find out just days before your planned travel whether you get to travel or not.

What Can One Do?

All of this clearly sucks, but the problem of visas is too big and messy for any of us to have any real impact on, at least in the short term. But if you’re organising a conference, and you want a diverse audience, here are a few things you can do:

  • Announce the dates of the conference as early as possible (allows participants to book travel, visa appointments, maybe club multiple conferences)
  • Provide invitation letters in a timely manner
  • Call for participation as early as possible
  • Notify speakers as soon as you can

I know of conferences that do some if not all of these things — you know who you are and you have my gratitude for it.

If you made it this far, thank you for reading.

GStreamer for your backend services

For the last year and a half, we at Asymptotic have been working with the excellent team at Daily. I’d like to share a little bit about what we’ve learned.

Daily is a real time calling platform as a service. One standard feature that users have come to expect in their calls is the ability to record them, or to stream their conversations to a larger audience. This involves mixing together all the audio/video from each participant and then storing it, or streaming it live via YouTube, Twitch, or any other third-party service.

As you might expect, GStreamer is a good fit for building this kind of functionality, where we consume a bunch of RTP streams, composite/mix them, and then send them out to one or more external services (Amazon’s S3 for recordings and HLS, or a third-party RTMP server).

I’ve written about how we implemented this feature elsewhere, but I’ll summarise briefly.

This is a slightly longer post than usual, so grab a cup of your favourite beverage, or jump straight to the summary section for the tl;dr.

Read More

Update from the PipeWire hackfest

As the third and final day of the PipeWire hackfest draws to a close, I thought I’d summarise some of my thoughts on the goings-on and the future.

Thanks

Before I get into the details, I want to send out a big thank you to:

  • Christian Schaller for all the hard work of organising the event and Wim Taymans for the work on PipeWire so far (and in the future)
  • The GNOME Foundation, for sponsoring the event as a whole
  • Qualcomm, who are funding my presence at the event
  • Collabora, for sponsoring dinner on Monday
  • Everybody who attended and participate for their time and thoughtful comments

Background

For those of you who are not familiar with it, PipeWire (previously Pinos, previously PulseVideo) was Wim’s effort at providing secure, multi-program access to video devices (like webcams, or the desktop for screen capture). As he went down that rabbit hole, he wrote SPA, a lightweight general-purpose framework for representing a streaming graph, and this led to the idea of expanding the project to include support for low latency audio.

The Linux userspace audio story has, for the longest time, consisted of two top-level components: PulseAudio which handles consumer audio (power efficiency, wide range of arbitrary hardware), and JACK which deals with pro audio (low latency, high performance). Consolidating this into a good out-of-the-box experience for all use-cases has been a long-standing goal for myself and others in the community that I have spoken to.

An Opportunity

From a PulseAudio perspective, it has been hard to achieve the 1-to-few millisecond latency numbers that would be absolutely necessary for professional audio use-cases. A lot of work has gone into improving this situation, most recently with David Henningsson’s shared-ringbuffer channels that made client/server communication more efficient.

At the same time, as application sandboxing frameworks such as Flatpak have added security requirements of us that were not accounted for when PulseAudio was written. Examples including choosing which devices an application has access to (or can even know of) or which applications can act as control entities (set routing etc., enable/disable devices). Some work has gone into this — Ahmed Darwish did some key work to get memfd support in PulseAudio, and Wim has prototyped an access-control mechanism module to enable a Flatpak portal for sound.

All this said, there are still fundamental limitations in architectural decisions in PulseAudio that would require significant plumbing to address. With Wim’s work on PipeWire and his extensive background with GStreamer and PulseAudio itself, I think we have an opportunity to revisit some of those decisions with the benefit of a decade’s worth of learning deploying PulseAudio in various domains starting from desktops/laptops to phones, cars, robots, home audio, telephony systems and a lot more.

Key Ideas

There are some core ideas of PipeWire that I am quite excited about.

The first of these is the graph. Like JACK, the entities that participate in the data flow are represented by PipeWire as nodes in a graph, and routing between nodes is very flexible — you can route applications to playback devices and capture devices to applications, but you can also route applications to other applications, and this is notionally the same thing.

The second idea is a bit more radical — PipeWire itself only “runs” the graph. The actual connections between nodes are created and managed by a “session manager”. This allows us to completely separate the data flow from policy, which means we could write completely separate policy for desktop use cases vs. specific embedded use cases. I’m particularly excited to see this be scriptable in a higher-level language, which is something Bastien has already started work on!

A powerful idea in PulseAudio was rewinding — the ability to send out huge buffers to the device, but the flexibility to rewind that data when things changed (a new stream got added, or the stream moved, or the volume changed). While this is great for power saving, it is a significant amount of complexity in the code. In addition, with some filters in the data path, rewinding can break the algorithm by introducing non-linearity. PipeWire doesn’t support rewinds, and we will need to find a good way to manage latencies to account for low power use cases. One example is that we could have the session manager bump up the device latency when we know latency doesn’t matter (Android does this when the screen is off).

There are a bunch of other things that are in the process of being fleshed out, like being able to represent the hardware as a graph as well, to have a clearer idea of what is going on within a node. More updates as these things are more concrete.

The Way Forward

There is a good summary by Christian about our discussion about what is missing and how we can go about trying to make a smooth transition for PulseAudio users. There is, of course, a lot to do, and my ideal outcome is that we one day flip a switch and nobody knows that we have done so.

In practice, we’ll need to figure out how to make this transition seamless for most people, while folks with custom setup will need to be given a long runway and clear documentation to know what to do. It’s way to early to talk about this in more specifics, however.

Configuration

One key thing that PulseAudio does right (I know there are people who disagree!) is having a custom configuration that automagically works on a lot of Intel HDA-based systems. We’ve been wondering how to deal with this in PipeWire, and the path we think makes sense is to transition to ALSA UCM configuration. This is not as flexible as we need it to be, but I’d like to extend it for that purpose if possible. This would ideally also help consolidate the various methods of configuration being used by the various Linux userspaces.

To that end, I’ve started trying to get a UCM setup on my desktop that PulseAudio can use, and be functionally equivalent to what we do with our existing configuration. There are missing bits and bobs, and I’m currently focusing on the ones related to hardware volume control. I’ll write about this in the future as the effort expands out to other hardware.

Onwards and upwards

The transition to PipeWire is unlikely to be quick or completely-painless or free of contention. For those who are worried about the future, know that any switch is still a long way away. In the mean time, however, constructive feedback and comments are welcome.

Applicative Functors for Fun and Parsing

PSA: This post has a bunch of Haskell code, but I’m going to try to make it more broadly accessible. Let’s see how that goes.

I’ve been proceeding apace with my 3rd year in Abhinav’s Haskell classes at Nilenso, and we just got done with the section on Applicative Functors. I’m at that point when I finally “get” it, so I thought I’d document the process, and maybe capture my a-ha moment of Applicatives.

I should point out that the ideas and approach in this post are all based on Abhinav’s class material (and I’ve found them really effective in understanding the underlying concepts). Many thanks are due to him, and any lack of clarity you find ahead is in my own understanding.

Functors and Applicatives

Functors represent a type or a context on which we can meaningfully apply (map) a function. The Functor typeclass is pretty straightforward:

Easy enough. fmap takes a function that transforms something of type a to type b and a value of type a in a context f. It produces a value of type b in the same context.

The Applicative typeclass adds two things to Functor. Firstly, it gives us a means of putting things inside a context (also called lifting). The second is to apply a function within a context.

We can see pure lifts a given value into a context. The apply function (<*>) intuitively looks like fmap, with the difference that the function is within a context. This becomes key when we remember that Haskell functions are curried (and can thus be partially applied). This would then allow us to write something like:

This function takes two numbers in the Maybe context (that is, they either exist, or are Nothing), and adds them. The result will be the sum if both numbers exist, or Nothing if either or both do not.

Go ahead and convince yourself that it is painful to express this generically with just fmap.

Parsers

There are many ways of looking at what a parser is. Let’s work with one definition: A parser,

  • Takes some input
  • Converts some or all of it into something else if it can
  • Returns whatever input was not used in the conversion

How do we represent something that converts something to something else? It’s a function, of course. Let’s write that down as a type:

This more or less directly maps to what we just said. A Parser is a data type which has two type parameters — an input type and an output type. It contains a function that takes one argument of the input type, and produces a tuple of Maybe the output type (signifying if parsing succeeded) and the rest of the input.

We can name the field runParser, so it becomes easier to get a hold of the function inside our Parser type:

Parser combinators

The “rest” part is important for the reason that we would like to be able to chain small parsers together to make bigger parsers. We do this using “parser combinators” — functions that take one or more parsers and return a more complex parser formed by combining them in some way. We’ll see some of those ways as we go along.

Parser instances

Before we proceed, let’s define Functor and Applicative instances for our Parser type.

The intuition here is clear — if I have a parser that takes some input and provides some output, fmaping a function on that parser translates to applying that function on the output of the parser.

The Applicative instance is a bit more involved than Functor. What we’re doing first is “running” the first parser which gives us the function we want to apply (remember that this is a curried function, so rather than parsing out a function, we are most likely parsing out a value and creating a function with that). If we succeed, then we run the second parser to get a value to apply the function to. If this is also successful, we apply the function to the value, and return the result within the parser context (i.e. the result, and the rest of the input).

Implementing some parsers

Now let’s take our new data type and instances for a spin. Before we write a real parser, let’s write a helper function. A common theme while parsing a string is to match a single character on a predicate — for example, “is this character an alphabet”, or “is this character a semi-colon”. We write a function to take a predicate and return the corresponding parser:

Now let’s try to make a parser that takes a string, and if it finds a ASCII digit character, provides the corresponding integer value. We have a function from the Data.Char module to match ASCII digit characters — isDigit. We also have a function to take a digit character and give us an integer — digitToInt. Putting this together with satisfy above.

And that’s it! Note how we used our higher-order satisfy function to match a ASCII digit character and the Functor instance to apply digitToInt to the result of that parser (reminder: <$> is just the infix form of writing fmap — this is the same as fmap digitToInt (satisfy digit).

Another example — a character parser, which succeeds if the next character in the input is a specific character we choose.

Once again, the satisfy function makes this a breeze. I must say I’m pleased with the conciseness of this.

Finally, let’s combine character parsers to create a word parser — a parser that succeeds if the input is a given word.

A match on an empty word always succeeds. For anything else, we just break down the parser to a character parser of the first character and a recursive call to the word parser for the rest. Again, note the use of the Functor and Applicative instance. Let’s look at the type signature of the (:) (list cons) function, which prepends an element to a list:

The function takes two arguments — a single element of type a, and a list of elements of type a. If we expand the types some more, we’ll see that the first argument we give it is a Parser String Char and the second is a Parser String [Char] (String is just an alias for [Char]).

In this way we are able to take the basic list prepend function and use it to construct a list of characters within the Parser context. (a-ha!?)

JSON

JSON is a relatively simple format to parse, and makes for a good example for building a parser. The JSON website has a couple of good depictions of the JSON language grammar front and center.

So that defines our parser problem then — we want to read a string input, and convert it into some sort of in-memory representation of the JSON value. Let’s see what that would look like in Haskell.

The JSON specification does not really tell us what type to use for numbers. We could just use a Double, but to make things interesting, we represent it as an arbitrary precision floating point number.

Note that the JsonArray and JsonObject constructors are recursive, as they should be — a JSON array is an array of JSON values, and a JSON object is a mapping from string keys to JSON values.

Parsing JSON

We now have the pieces we need to start parsing JSON. Let’s start with the easy bits.

null

To parse a null we literally just look for the word “null”.

The $> operator is a flipped shortcut for fmap . const — it evaluates the argument on the left, and then fmaps the argument on the right onto it. If the word "null" parser is successful (Just "null"), we’ll fmap the JsonValue representing null to replace the string "null" (i.e. we’ll get a (Just JsonNull, <rest of the input>)).

true and false

First a quick detour:

The Alternative instance is easy to follow once you understand Applicative. We define an empty parser that matches nothing. Then we define the alternative operator (<|>) as we might intuitively imagine.

We run the parser given as the first argument first, if it succeeds we are done. If it fails, we run the second parser on the whole input again, if it succeeds, we return that value. If both fail, we return Nothing.

Parsing true and false with this in our belt looks like:

We are easily able express the idea of trying to parse for the string “true”, and if that fails, trying again for the string “false”. If either matches, we have a boolean value, if not, Nothing. Again, nice and concise.

String

This is only slightly more complex. We need a couple of helper functions first:

hexDigit is easy to follow. It just matches anything from 0-9 and a-f or A-F.

digitsToNumber is a pure function that takes a list of digits, and interprets it as a number in the given base. We do some jumping through hoops with fromIntegral to take Int digits (mapping to a normal word-sized integer) and produce an Integer (arbitrary sized integer).

Now follow along one line at a time:

A string is a valid JSON character, surrounded by quotes. The *> and <* operators allow us to chain parsers whose output we wish to discard (since the quotes are not part of the actual string itself). The many function comes from the Alternative typeclass. It represents zero or more instances of context. In our case, it tries to match zero or more jsonChar parsers.

So what does jsonChar do? Following the definition of a character in the JSON spec, first we try to match something that is not a quote ("), a backslash (\) or a control character. If that doesn’t match, we try to match the various escape characters that the specification mentions.

Finally, if we get a \u followed by 4 hexadecimal characters, we put them in a list (replicateM 4 hexDigit chains 4 hexDigit parsers and provides the output as a list), convert that list into a base 16 integer (digitsToNumber), and then convert that to a Unicode character (chr).

The order of chaining these parsers does matter for performance. The first parser in our <|> chain is the one that is most likely (most characters are not escaped). This follows from our definition of the Alternative instance. We run the first parser, then the second, and so on. We want this to succeed as early as possible so we don’t run more parsers than necessary.

Arrays

Arrays and objects have something in common — they have items which are separated by some value (commas for array values, commas for each key-value pair in an object, and colons separating keys and values). Let’s just factor this commonality out:

We take a parser for our values (v), and a parser for our separator (s). We try to parse one or more v separated by s, and or just return an empty list in the parser context if there are none.

Now we write our JSON array parser as:

Nice, that’s really succinct. But wait! What is json?

Putting it all together

We know that arrays contain JSON values. And we know how to parse some JSON values. Let’s try to put those together for our recursive definition:

And that’s it!

The JSON object and number parsers follow the same pattern. So far we’ve ignored spaces in the input, but those can be consumed and ignored easily enough based on what we’ve learned.

You can find the complete code for this exercise on Github.

Some examples of what this looks like in the REPL:

Concluding thoughts

If you’ve made it this far, thank you! I realise this is long and somewhat dense, but I am very excited by how elegantly Haskell allows us to express these ideas, using fundamental aspects of its type(class) system.

A nice real world example of how you might use this is the optparse-applicative package which uses these ideas to greatly simplify the otherwise dreary task of parsing command line arguments.

I hope this post generates at least some of the excitement in you that it has in me. Feel free to leave your comments and thoughts below.

A Late GUADEC 2017 Post

It’s been a little over a month since I got back from Manchester, and this post should’ve come out earlier but I’ve been swamped.

The conference was absolutely lovely, the organisation was a 110% on point (serious kudos, I know first hand how hard that is). Others on Planet GNOME have written extensively about the talks, the social events, and everything in between that made it a great experience. What I would like to write about is about why this year’s GUADEC was special to me.

GNOME turning 20 years old is obviously a large milestone, and one of the main reasons I wanted to make sure I was at Manchester this year. There were many occasions to take stock of how far we had come, where we are, and most importantly, to reaffirm who we are, and why we do what we do.

And all of this made me think of my own history with GNOME. In 2002/2003, Nat and Miguel came down to Bangalore to talk about some of the work they were doing. I know I wasn’t the only one who found their energy infectious, and at Linux Bangalore 2003, they got on stage, just sat down, and started hacking up a GtkMozEmbed-based browser. The idea itself was fun, but what I took away — and I know I wasn’t the only one — is the sheer inclusive joy they shared in creating something and sharing that with their audience.

For all of us working on GNOME in whatever way we choose to contribute, there is the immediate gratification of shaping this project, as well as the larger ideological underpinning of making everyone’s experience talking to their computers better and free-er.

But I think it is also important to remember that all our efforts to make our community an inviting and inclusive space have a deep impact across the world. So much so that complete strangers from around the world are able to feel a sense of belonging to something much larger than themselves.

I am excited about everything we will achieve in the next 20 years.

(thanks go out to the GNOME Foundation for helping me attend GUADEC this year)

Sponsored by GNOME!

Stricter JSON parsing with Haskell and Aeson

I’ve been having fun recently, writing a RESTful service using Haskell and Servant. I did run into a problem that I couldn’t easily find a solution to on the magical bounty of knowledge that is the Internet, so I thought I’d share my findings and solution.

While writing this service (and practically any Haskell code), step 1 is of course defining our core types. Our REST endpoint is basically a CRUD app which exchanges these with the outside world as JSON objects. Doing this is delightfully simple:

That’s all it takes to get the basic type up with free serialization using Aeson and Haskell Generics. This is followed by a few more lines to hook up GET and POST handlers, we instantiate the server using warp, and we’re good to go. All standard stuff, right out of the Servant tutorial.

The POST request accepts a new object in the form of a JSON object, which is then used to create the corresponding object on the server. Standard operating procedure again, as far as RESTful APIs go.

The nice part about doing it like this is that the input is automatically validated based on types. So input like:

will result in:

Error in $: expected String, encountered Number

However, as this nice tour of how Aeson works demonstrate, if the input has keys that we don’t recognise, no error will be raised:

This behaviour would not be undesirable in use-cases such as mine — if the client is sending fields we don’t understand, I’d like for the server to signal an error so the underlying problem can be caught early.

As it turns out, making the JSON parsing stricter and catch missing fields is just a little more involved. I didn’t find how this could be done in a single place on the Internet, so here’s the best I could do:

The idea is quite straightforward, and likely very easy to make generic. The Data.Data module lets us extract the constructor for the Job type, and the list of fields in that constructor. We just make sure that’s an exact match for the list of keys in the JSON object we parsed, and that’s it.

Of course, I’m quite new to the Haskell world so it’s likely there are better ways to do this. Feel free to drop a comment with suggestions! In the mean time, maybe this will be useful to others facing a similar problem.

Update: I’ve fixed parseJSON to properly use fieldLabelModifier from the default options, so that comparison actually works when you’re not using Aeson‘s default options. Thanks to /u/tathougies for catching that.

I’m also hoping to rewrite this in generic form using Generics, so watch this space for more updates.