New web-based control surface and remote collaboration for Ardour

Hi everyone! I’m a long-time lurker here (since 2008 or so) and big time Ardour fan, and I have used it as my sole replacement for Nuendo and Cubase ever since I learned about it. And apologies in advance for the novel, normally I am much more terse with my communication but I have a lot to share today!

So the thing with the recent leap in capabilities with generative AI is that it is super easy to nerd-snipe yourself - this started two weekends ago with a self-imposed challenge of making a polished auto-vocoder lv2 plugin with Claude Code because I was having trouble finding one that I liked. About 30 minutes later I had one working, I was super excited, and then this challenge escalated into β€œcan I wire an AI agent directly into Ardour, and can I make Ardour kid-friendly since my children love playing with the audio effects on a live monitored mic?”.

The original challenge of putting an AI agent into Ardour was back burnered (I do this for my day job so it isn’t all that novel), as was making Arour kid-friendly, but all of the plumbing is now in place to make either of these relatively low lifts. And what came out of this week-long AI-induced manic episode is:

Foyer Studio

Named for this, the little studio I built in the foyer for my kids to learn audio engineering (and also an etymological link with the words ardour and foyer having a common latin roots that literally mean flame/fire/heat/burning, and hearth/fireplace respectively):

This would not have been possible without all of the work Paul Davis, Robin Gareus and all of the others put in over the last 20+ years on JACK and Ardour, so go donate!

So what is this?

It’s a web-native, DAW remote projection control surface with real-time collaborative tooling (e.g. multiple people patched into the same session anywhere in the world) that currently has one back-end: Ardour. The user interface runs in a web browser and it can be shared with session collaborators over CloudFlare tunnels for low-ceremony real-time remote collaboration. The use cases I envisioned are:

  • Sharing a mix in progress with a performer and being able to make changes in realtime with their feedback, without them being physically present (yes I know screen shares with audio exist, but this was more fun!)
  • A performer laying a track remotely
  • Real-time collaboration on the mix, effects, timeline, instruments, etc.
  • Providing accessible hackability to the DAW’s user interface - by reprojecting the state of a DAW like Ardour to a web-based UI, tons of possibilities open up for being able to satisfy use cases you just couldn’t do in a pro-grade DAW UI, like making Ardour a MIDI-only application ala Cakewalk 3, making a transport remote for your phone so you can engineer your own peformances with ease, making a kid-friendly touch-friendly interface, or:
  • Extending the core DAW with new feature compositions - for example, I created a simple beat and piano-roll sequencer that will generate MIDI from the sequencer and it saves the sequencer data in the region’s data extension in the .ardour file, because my 8 year old loves playing with Hydrogen and frankly so do I

Here are some videos of it in action:

Installation from scratch + tracking from browser in a new project
Loading an existing project, adding beats and music with sequencer, instrument selection
Remote collaboration over CloudFlare tunnels (free, no account required!), plugins, floating and tiled window management, right dock with tear-out FABs

How is it built?

Let me be straight up - this project was vibe coded AF - no way does one person with minimal domain knowledge of the inner-workings of a DAW build something like this in their free time from ideation to fully functioning prototype over the course of a 9 days without having Claude Opus (and Kimi K2.6 too) do the heavy lifting. Let me rephrase that - this was agentically-engineered. I do have strong opinions on the tech stack used, the architecture, the design, look and feel, and the features, so those all came from me. The implementation and execution of those opinions is all clankers.

The core app consists of a Rust/axum API foyer that speaks to the Ardour back end via an IPC (MsgPack over a unix socket) - this IPC carries a DAW-agnostic payload consisting of commands, state events, mutations, lifecycle events, audio streams, object properties (e.g. mixer channels, automation, midi rolls), session management and more from a β€œshim” - basically a C++ control surface plugin compiled against libardour.so - the shim translates between foyer’s agnostic format to Ardour’s internal data structures, function calls, etc. and taps into the internal audio streams of Ardour so we can get audio in and out.

AI-generated text diagram coming in 3…2…1:

 Ardour (or any DAW)
    β”‚
    β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   Unix socket, MessagePack framing
 β”‚ C++ shim               β”‚  ───────────────────────────────┐
 β”‚  (libfoyer_shim.so)    β”‚                                 β”‚
 β”‚  Β· event translation   β”‚                                 β–Ό
 β”‚    only β€” no UX logic  β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚ Rust sidecar         β”‚
                                               β”‚  Β· foyer-server      β”‚
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚  Β· foyer-backend-*   β”‚
 β”‚ Web UI (Lit + Tailwind)β”‚ ◄── WS / HTTP ────►│  Β· foyer-schema      β”‚
 β”‚  Β· three-tier split    β”‚                    β”‚  Β· foyer-ipc         β”‚
 β”‚  Β· no shipping bundler β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                 β”‚
                                                            β–Ό
                                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                               β”‚ Optional:            β”‚
                                               β”‚  foyer-desktop (wry) β”‚
                                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The same architecture could apply to any DAW provided it has a rich-enough SDK or is hackable enough, but Ardour has my :heart: so that was my target; if anyone wants to try to make a shim for Bitwig or REAPER I’ll be accepting PRs.

The UI lives on the other side of the foyer API and speaks to it via standard http for static asset service like the UI’s JavaScript, HTML, CSS and image assets, and over websockets for streaming the state and audio IO to the UI, using the same serialized structures as the shim speaks but in JSON payloads. Audio is sent via an Opus-compressed stream by default, with the option to switch to uncompressed streaming.

The UI was made with zero-build JavaScript using Lit as the primary framework (e.g. standard web components, shadow DOM), Tailwind CSS, and a few vendored helper libraries, and that’s really it. It is intentionally super hackable, and the core logic for audio and binding event IO over websockets is in a stand-alone library with no UI/DOM at all, then the rest of the UI lives in downstream-dependent projects, totally separated. The UI is dumped to the folder ~/.local/share/foyer/web on first launch and can be updated and served out in real time.

There are a handful of other novel things architecturally, like built-in tunnels (CloudFlare and ngrok, with ngrok untested) for instant sharing of a session over a tunnel using RBAC-based single use user accounts with scoped permissions (e.g. viewer, performer, session controller, admin) that can be shared in 3 clicks for the first one, 1 click for each additional collaborator; relay chat and push-to-talk audio conferencing; remote audio ingress so a performer can lay a track from afar (with self-monitoring disabled because of the additional latency); an advanced tiling window manager in the shipping UI; and some other things that are escaping me at the moment.

Licensing: the C++ shim links libardour and is GPL-2.0+ as required. The Rust sidecar and web UI sit above the IPC boundary and ship under Apache-2.0.

What actually works?

Anything you saw in the demo video was legit - not production-grade, but usable as seen (β€œit works on my machine” haha). Unfortunately I wasn’t able to figure out how to record the browser audio with the screen recordings on a Mac, but it was mostly me singing into a dell speakerphone so you didn’t miss much. I will open issues on the GitHub to track things that are broken and fix them as I can. There are rough edges, weird quirks, and other things that will need to be sorted over time.

There are a ton of features in Ardour, and I am targeting the most common/visible subset of available features to make something compelling but pragmatic that taps into that ecosystem; it would be insane to try to chase 20+ years of features off the bat. My goal isn’t to recreate Ardour in a web browser, but to compliment it with new collaboration abilities as well as unlock a whole world of UX possibilities and feature extensibility.

What is left to do?

The agentic harness specific to a DAW was a big one I wanted to do, I just had to build all of this other stuff around it to make it useful and practical. There is a nice set of MCP tools being added into Ardour’s core which will definitely be useful for an agent, and I have additional tools I’d want to add like spectrography, waveform visualizations and audio snippets (for multi-modal models), plain-English descriptions of mixer states, plugin states, track states and more, and UI integration.

Others include:

  • A touch-friendly UI, both big and small (e.g. transport remote from your phone)
  • A kid-friendly UI
  • Exposing additional Ardour features that were omitted (punches, cross-fades, z-indexes for track stacking, markers, dynamic tempos and time signatures, splicing, snapping, video integration, tons of stuff on the instrumentation and plugins side)
  • Test Linux installation (MacOS thoroughly tested)
  • MIDI instrumentation is missing some critical functionality like a functioning patch selector, and there are a few bugs with the piano roll, synth instruments and region resizing that need a once over to get this usable; drum kits seem to work well though
  • Make sure this works on real hardware via remote, not just the 48khz audio from my MBP
  • Keyboard-first session navigation (limited to a handful of functions now)
  • Multi-window support for multiple monitors
  • Tighter session recovery/reattachment logic
  • Reliable attachment to running GUI instances of Ardour
  • More advanced UI-side presets management, templates, power tools for reducing clicks
  • Front-end visualizations - with WebGL and WebGPU as a first-class APIs we can create rich accelerated visualizations that look and feel awesome
  • Building, packaging, and supporting more platforms (currently just Linux and MacOS arm64)
  • Temporaly-accurate syncing of audio and controls (e.g. delay control movement by latency incurred from the audio stack)
  • Temporaly-accurate offsetting for regions recorded remotely via foyer audio ingress

So how do I run it?

curl -fsSL https://raw.githubusercontent.com/hotspoons/foyer-studio/main/install.sh | bash -s -- --latest-ci

(Currently Mac arm64 and Linux arm64/amd64 are supported, pulls latest successful build from main branch)

Then source your shell’s rc file or spawn a new shell, then run foyer serve to launch it on http://localhost:3838. The config file, web ui, and more lives under ~/.local/share/foyer and will be present after the first launch.

NOTE: the shim is currently built against the Ardour 9.2 ABI and will not be compatible with other versions of Ardour. If you want to build for a different tag, branch or ref, follow these instructions, though code changes may be required in the shim if the ABIs called from the shim differ in the target ref.

For development, clone the repository and open the dev container.

2 Likes

Sounds like a fun project.

Please see Ardour Development | Ardour DAW - It is unlikely that you can license AI/LLM generated code in terms of the GPL, and hence cannot use AI/LLM to write a control surface for Ardour…

On the upside, recent Ardour/git has a MCP server, which will be included with the 9.3 release, and it looks it can replace your shim.

Rodger that, and I kept the shim as an out-of-tree downstream library instead of trying to put this in a PR because of the uncertainty with copyleft requirements and LLM-generated code.

I’ve been tracking the MCP tools in Ardour for a bit - are you planing on adding visualizations (e.g. spectrograms, heat maps, wave forms) and audio snippets for multimodal models?

It seems like a great project and I could use those remote collaboration features with my band member.

The biggest concern for a vibe - coded project is can it be maintained. The dependencies for your project will change in unexpected ways and you to have keep the project updated all the time. It’s a long time commitment and sometimes a lot of work. I know this 'cause I have maintained an open source project for 14 years now.

Also one should make a project like this hardened against attacks. It can be as simple as to ask AI to play the role of a pen tester and find vulnerabilities in the code. Then just ask it to fix what it finds.

I’m finding your project very interesting and will keep an eye on it. I definitely can see some use cases for it.

Good work :slight_smile:

1 Like

Unlikely. How would that even work? Waveforms are zoom dependent, and not generally useful for a LLM. I also don’t know what a heat map in DAW context is.

But yes, raw audio data, or output of data-analysis (spectrum, loudness, etc) can be sent via JSON for the remote end to interpret or visualize.

I say β€œvibe coded” tongue-in-cheek - the entire back-end codebase (C++ shim and Rust back end) was constantly reviewed, refactored, and rearchitected as it was generated (front end not so much) - the future where someone who isn’t working inside of a frontier AI lab can prompt a project like this in one shot is still a ways off. And good call on the AI security audit, I normally do this as I develop but I’ll give it a once over RN.

Maintenance is always a pit-fall with any project; I originally made this for my kids to have streamlined experience so they could have fun learning audio engineering without needing to learn how to operate a pro-grade DAW. If other folks find it useful I would definitely maintain it as it was rewarding to make.

Re. waveforms, you’d expose the range and track selection as tool call parameters to let the LLM get the framing it needs for iterative project exploration. By heat map (or flame graph) I just mean categorized events over time visualized with context like a legend so an LLM could easily navigate a project. The utility of any visualization or audio data depends on how much representation exists in a model’s training data, plus if they had any SFT or RL specific-to or adjacent-to any specialty like audio engineering. Multimodal models do see millions of spectrographs and waveform visualizations during training runs, so I wouldn’t discount potential benefits here.

Have you tried anything like promptfoo to evaluate tooling prompts, tooling surface and schemas for efficacy?

I’ll be honest, I am pretty damn impressed (Not that that means much). There is a huge amount of work that is likely left to be done with things like plugins, custom UIs, etc. for true remote collaboration for a lot of things, but the amount of work already done is very impressive as well, as this is essentially a new UI for Ardour built using web technologies, and frankly some niceties for basic pattern making etc. already in there.

I also note this is all over a local connection in the videos, I am curious how performant this is over internet, and I am suspecting some aspects may not work quite as well like synchronized cursors etc. but then again I am not 100% certain they need to work either. Of course things going through my head at the moment are things like independent playback timelines which would require backend work in Ardour itself to support I believe, so that people could be working on, auditioning, etc. different parts of the timelines simultaneously for larger projects.

Got the wheels spinning in my head at least in terms of what could be possible. Don’t get me wrong, I am still a bit cynical about maintenance of vibe coding projects, safety of their operations, etc. but I also recognize that the chances of something like this existing without it is much smaller as well, licensing being an entire different topic to contend with on that front.

True but what are the peak files if not representations of the waveforms, maybe the zoom doesn’t need to be worried about, just a representation of the waveform itself and let the client figure out how to display it? Then again some form of this is apparently already possible if I am watching those YouTube videos correctly.

The same might be able to be said of things like spectrograms etc, but yea I also don’t know what heat maps are in this context.

  Seablade
1 Like

So essentially a bit more organized Undo/Redo stack that could be navigated?

Seablade

Re over the internet, the last demo video was going through a cloudflare tunnel (probably in Ashburn or Manassas) in the client on the right, so usually about a ~10-15ms hop from my house over gig fiber. I also experimented over 5g with my phone and it seemed more like a ~30ms delay. I still need to work out how to sync the surface state and audio streams (two separate websockets, audio will have higher latency) on the client, but I figure this is pretty well sorted for things like bluetooth audio devices syncing with video so I was going to explore a solution later (probably in Ardour’s core too).

Re. independent timelines, by design there is one and only timeline per session. Multiple sessions can be opened in one foyer instance, but not multiple timelines per session (unless you open the same project multiple times, but then you would need to reconcile states later, would be gnarly to figure out). Multiple people can work on different parts (e.g. mixes, effects, midi, timeline) of the same session simultaneously though! You probably just want one person controlling the transport.

Re. heat map I was actually thinking like region starts/ends, various markers, punch-ins, automation points, things like that. The undo/redo stack would also be interesting!

Ardour’s latency compensation already handles this case just fine. If playback latency is correctly reported to Ardour. Ardour’s clocks and timecode generators align to the output (when audio reaches the speakers). This works well with Ardour’s video timeline and externally synced devices.

Playback latency is the part I’d need to figure out, because I have two separate sockets connected to the web UI (one for state, one for audio IO). The audio IO will have higher latency than the surface state because by default it goes through an Opus codec (with a raw bypass), and when I wire in support for audio stream resampling in the back end, this will add additional latency as well.

I can probably figure this out by sticking regular sync frames in each stream and calculating offsets when they arrive on the client side, then reporting that as the playback latency (plus what ever else is stacked for local buffers).

The problem is that I am thinking of things like SFX vs Music vs Dialog, etc. for film. In those cases yes they may be interested in completely seperate sections of the timeline.

Again pie in the sky dreaming I haven’t put much thought into, and would require a huge rework of the audio engine probably to handle so I am not expecting anything, just got me thinking is all.

Seablade