Is it possible to connect the Ardour GUI to a headless Ardour engine through a socket?

I am trying to build a turn-key mixer around Ardour, with a cheap laptop as a control surface and processing being done in a seperate box containing a USB interface and a single board computer. (Practically speaking, like the Behringer XR18)

I read in the manual that Ardour has support for OSC and experimental support for websockets, but I was wondering if the stock Ardour UI could be hooked up through a UNIX or network socket? That would give me all of the flexibility of running Ardour locally with minimal glue. I read here that Ardour’s engine and UI are fully separate, so I was inclined to believe that this is doable, but I know reality is always more complicated.

I’m not very familiar with Ardour’s internals. Would this be practical for me to implement? Or, is there a show-stopper that would make this a non-starter?

Thanks,
–Tyler

1 Like

No, the “stock” Ardour GUI is written as an application, not as a front end for a server.

They are “separable” at the code level in the sense that the backend does not know about the GUI (or any UI), which is what allows the headless version to exist. But they are not separable in the sense of running the GUI separately.

1 Like

You could use otherwise some usual Unix tricks like launching Ardour in an X11 VNC server (or use x11vnc to export an existing X11 session) and interacting with a VNC client from various clients including your local machine, your laptop, your tablet, phone…
The interaction is a little bit awkward because it is a remote session with the window manager of the machine running Ardour, but that is better than nothing.

Ah that clears it up, thanks. A man can dream I guess :​P

I had considered this! Was going to use waypipe until I realized that Ardour doesn’t support Wayland. My concern with VNC is mainly latency, but maybe X forwarding will offer good results. Will give it a shot and report back.

You don’t need VNC. Ardour is an X Window app, and can be run remotely just using X Window. I used to do this over WANs as a test sometimes. It works fine.

The bigger issue is that you tend to need a way to route both the GUI and the audio, and X Window doesn’t help with that (not sure if VNC does; a detached GUI process wouldn’t either)

2 Likes

In my case, I don’t need the audio on the laptop, just need controls, a la Behringer X-32 Edit. The box does the audio on it’s own. I’m not sure if I’m even taking the right approach, so I’ll elaborate on my goals in case anyone here knows a better way.

My primary goals are:

  1. Provide FoH and foldbacks for the performers
  2. Provide AES67 through Pipewire so I can run a quiet mix to other areas of the venue
    • most cheap laptops don’t support the hardware time-stamping needed for PTPv2 and there are no USB NICs available that do either (although the brand-new ASIX AX88279A gives me hope for the future!), so this is a non-starter without a separate box.
  3. Live multi-track recording
  4. Be physically small and portable, possibly battery powered (might be unrealistic)
  5. Be built out of junk I have laying around, and therefore being cheaper than an X32/XR18

This just a “for-fun” project that’s mostly gonna be used for dirt-poor punk bands with zero budget, so the cost is the most important part.

In practice, because of the tiny buffer sizes required to provide foldback monitoring with adequate latency (esp. for vocalists), using the laptop for other things (or even just touching the touchpad) causes XRUNs. (I am starting to realize how spoiled the mixer mfgs. are with their microcontrollers, RTOS’s, and hardware DSPs lol)

In an attempt solve this, I built a highly optimized Linux system that provides the bare minimum needed to run Ardour and nothing else, and I’ve found I can get the buffer size to 32 with Pipewire before I get XRUNs (compared to 256 on my much more powerful workstation). This gave me an end-to-end latency of ~2.4ms at 192KHz with a Behringer UMC404HD, which I hope is low enough. On paper this should just mean the monitors feel ~2.5ft further away than they really are (compared to the ~15ft on my workstation), right?

To avoid unnecessary load (especially CPU interrupts), I don’t want to do any graphical rendering or control on the device. Things are so tight that touching the touchpad causes dozens of XRUNs on my laptop. I’d imagine I could implement basic controls with OSC, but this feels like re-inventing the wheel so I wanted to avoid it if possible.

On a sidenote, I am worried this may be an X-Y problem. I’m concerned I’m over-estimating how much latency I can have before it becomes distracting to the performers. I have no one to test on except myself, so I don’t have a frame of reference. If there was a way that I could have a few more ms of wiggle room, I’d imagine this would be significantly easier. Am I making things harder than I need to? I would love to have room for plugins like a compressor/limiter/EQ, etc. but I don’t know how practical that is given my constraints.

For now, I’ll have to give X-Forwarding a shot. It’s a bit clunky but I think it’s my best solution for now. I would greatly appreciate any other advice if you have any, because this seems like thoroughly uncharted territory.

Thanks,
–Tyler

This is a sign of either poor hardware or a misconfigured system. With the right hardware and a properly configured system, interacting with e.g. the trackpad cannot interfere with realtime threads handling audio. That said, finding the right hardware in generic PCs let alone laptops is quite a challenge, and fully configuring it can also be quite a lot of work. Which is another reason why Apple systems are popular for this sort of thing, because they Just Work ™.

1 Like

On the laptop (dell latitude 5440), I have done the following with only mild improvement:

  • Disabled SMT
  • Added threadirqs and preempt=full to the kernel cmdline (using linux-lts from Arch)
  • Ensured kernel was built with CONFIG_NO_HZ=y (etc.) and CONFIG_HIGH_RES_TIMERS=y
  • Ensured the scaling governor is set to performance
  • Ensured Pipewire is running with realtime privileges
  • Decreased swappiness (although I am not memory constrained)

I wasn’t able to use linux-rt as it made my system unstable (crashed repeatedly and blew up my btrfs) and I can’t track down why.

If it is just crummy hardware, unfortunately, despite having dozens of laptops laying around here, this is the only one I have that supports PTPv2 for AES67. The Pi 5 I was going to use for the box was going to be my workaround for that if I could figure out a control surface.

Outside of the touchpad, there are also random XRUNs that I assume are caused by the desktop environment (KDE) etc. but I haven’t been able to track those down. I wouldn’t be surprised if the issue was mediocre hardware but better stuff isn’t in the budget unfortunately or I’d buy a hardware mixer. On the bright side, I am willing to invest tremendous amounts of time into fixing this, because as a Linux user, my time has very little value to me. (As the old saying goes; Linux is only free if your time has no value :​P) Even if I can’t make it work in the end, I appreciate the learning experience.

I guess my remaining questions are:

  1. Is there anything I missed given what I’ve tried so far?
  2. Is my latency target lower than strictly necessary for my application?

I would say “it depends”. X11 is an old protocol from the '80s (but I even remember the previous version X10 when I was a student) and it is super verbose, very fragile and sensitive to latency or network interruption.
What is cool with X11 with VNC rendering is that it decouples things.
X11 is rendered super-fast on the VNC server locally, while the VNC remote client get the content at its speed, asynchronously, according to the network quality, can be disconnected and reconnected without crashing the application to multiple clients at the same time.
I use this on some VMs in France while being in California and it works pretty well. Quite some latency and various network quality mileage. But I have not tried Ardour in that configuration. :wink:

  1. I didn’t see in your post mention of prioritizing your UMC404HD audio device. Doing so should make a noticeable difference in reducing xruns. There are different methods to do this, but I think the most modern way is to utilize rtcirqus:

2(a). A buffer size of 32 is very aggressive and not something I have achieved on any computer I’ve owned. Personally, I don’t find it necessary to go below 64 to monitor myself without latency affecting my performance. I can make a 128 buffer work, but latency is nearly imperceptible at 64 per my experience.

2(b). A sample rate of 192 kHz also seems unnecessary. I would recommend using 44.1 or 48 kHz. It will save you significant disk space and be less taxing on your CPU, freeing up resources for plugins. If you are recording live performances, there should be no detectable difference in audio quality. My understanding is that sample rates higher then 44.1 kHz (or 48 kHz if the audio may someday be synced to video) are only useful in edge cases involving sound synthesis using plugins that make use of oversampling. Humans cannot hear anything above 22 kHz, which is why 44.1 kHz was chosen as the sampling rate for compact discs.

  1. I didn’t see in your post mention of prioritizing your UMC404HD audio device. Doing so should make a noticeable difference in reducing xruns. There are different methods to do this, but I think the most modern way is to utilize rtcirqus:

I didn’t know that was something I could do, thanks! I’ll give it a shot ASAP.

2(a). A buffer size of 32 is very aggressive and not something I have achieved on any computer I’ve owned. Personally, I don’t find it necessary to go below 64 to monitor myself without latency affecting my performance. I can make a 128 buffer work, but latency is nearly imperceptible at 64 per my experience.

After further experimentation, I concur, I may bump it up a bit. Monitoring vocals was distracting through a headset at larger buffer sizes, but I tried a speaker and I found it significantly easier to tolerate. Someone on linuxmusicians explained that this is likely due to comb filtering in my head between my head voice and the monitor. This makes sense in hindsight, should’ve tried it sooner.

2(b). A sample rate of 192 kHz also seems unnecessary. I would recommend using 44.1 or 48 kHz. It will save you significant disk space and be less taxing on your CPU, freeing up resources for plugins. If you are recording live performances, there should be no detectable difference in audio quality. My understanding is that sample rates higher then 44.1 kHz (or 48 kHz if the audio may someday be synced to video) are only useful in edge cases involving sound synthesis using plugins that make use of oversampling. Humans cannot hear anything above 22 kHz, which is why 44.1 kHz was chosen as the sampling rate for compact discs.

I agree with you in principle. I know the sampling theorem says that 44.1KHz will give me all of the bandwidth I need, but I discovered that I achieve significantly better latency at 192KHz. I get ~2.4ms end-to-end with 32 samples/buffer @ 192KHz, versus ~4.1ms with 8 samples/buffer @ 44.1KHz. Luckily I have no shortage of disk space, so I can tolerate this trade-off. Not sure if this will pose a problem with plugins however given that ~4.4x as much data has to be processed in the same amount of time. Guess I’m gonna have to do more tests and see.

Well then, I’m glad I mentioned it. The threadirqs and preempt=full kernel boot parameters are not being fully utilized for low latency audio if your sound card is not prioritized over other hardware, so get on that. :slight_smile:

I didn’t consider the reduction in latency using a higher sample rate, good point. I personally haven’t found the trade off to be beneficial, but that is my take. YMMV.

1 Like

Make sure you connect your USB audio interface to an USB port that has as little other peripherals connected to it as possible. You can check that with a tool related to rtcirqus, rtcqs: rtcqs/rtcqs: rtcqs is a Python utility to analyze your system and detect possible bottlenecks that could have a negative impact on the performance of your system when working with Linux audio. - Codeberg.org

The only thing you can’t determine with rtcqs is which physical USB port is which. But unplugging and plugging in your mouse and checking every time with lsusb should give you a clue which USB port corresponds to which USB port rtcqs mentions.

1 Like

They’re not spoiled. They have a whole 'nother set of problems.

There are countless ways to do the same thing, and economy of scale is HUGE for chip manufacturers. So it seems that every chip can do everything, and the developers have to wade through all of that.

Every standard has its tradeoffs:

  • One has near-zero latency but requires a TON of separate wires that all have to be exactly the same length, hence the squiggles that you might have seen on a complex circuit board. (yes, the speed of light matters, even across that short distance) Another eliminates that problem by only using a couple of wires, but it also creates an entire sample time of latency, just for that one communication link.
  • One allows a bunch of channels to be carried across the same set of wires, which pushes the signals farther up into the radio frequency spectrum with corresponding EMI issues. Another only allows two channels, which allows a cheaper design for that link.
  • One allows some additional information to be sent across the same link. Another only allows the samples themselves.
  • Etc.

It’s not as simple as saying, “I want this named standard,” even though the chip manufacturers try to make it that way with their software libraries. The hardware configuration itself knows nothing about any standard, and is simply told the raw details instead, in a format that is itself designed to be easy for a set of logic gates to interpret. Different combinations of details result in different standards being followed, and it’s just as easy to mis-configure and violate every standard. This could be developer-error, OR a bug in the chip manufacturer’s library.

And there may even be cases where you’d want to modify a standard for some reason. Technically in violation of the official spec, but it works better for what you’re specifically doing right here and now. Specifying details to the hardware instead of a name, allows that.

Even when the hardware is “right”, it often doesn’t cover the entire standard. There’s still software involved to finish that standard. Again, manufacturer’s library vs. user code, with the possibility for bugs either way, and for specific-application tweaks.

All of that is only for communication between chips: [ADC] → [DSP] → [DAC]. Inside the DSP box, there could be a single chip that does everything, or multiple chips that have their own communication between them, which creates another round of the above with possibly different decisions. And of course, there’s still the tradeoff between using a licensed DSP library that may or may not be cost prohibitive, and rolling your own that may or may not have some performance issues with no outside license.

And it’s not quite as simple as just [ADC] → [DSP] → [DAC] either. Some ADC’s and DAC’s are just the converter and that’s it, while others include their own small DSP. Do you want to use that? Or do you want to keep everything inside of the main DSP chip(s)? Different projects with different goals have different answers to that question.

And of course, there’s the choice of DSP chip itself. How much do you want done for you, with possibly non-ideal decisions locked in? And how much do you want to do yourself, with the amount of direct understanding required to do that?


So yes, once you work through all of THAT, it’s easy to have just 3 samples of DSP latency plus the 10 or so that each converter adds by itself - 1 in, 1 for processing, 1 out - but you do have to work through all of that!



In case you’re wondering, the 10 or so samples of latency that each converter adds, comes from the high-order digital lowpass that is a necessary part of how the conversion works in the first place.

There is no direct conversion from analog to 24-bit digital at 48kHz, or vice-versa. The technology to do that is nowhere close to existing, and it would require a stupidly expensive analog lowpass anyway, to keep it from “aliasing”, or “accordion folding” higher frequencies up to infinity, back down into the audible range, after which they cannot be removed.

Instead, it physically samples in the mid-MHz range, which allows a much cheaper analog filter, with fewer bits because that is possible, plus some intentional high-frequency wiggle added to the analog signal to guarantee that the reading is always changing. That insanely-fast low-resolution signal is then digitally lowpassed to do three things simultaneously:

  • Anti-alias to the final sample rate
  • Remove the high-frequency wiggle
  • Fill in the lower bits, as a sort of average of that wiggle, biased by how close the real value was to a transition between physical readings

A side-effect of that digital lowpass is to add the conversion latency (no free lunch), and that is where most of a commercial console’s latency spec comes from. It’s not from the DSP.

Then once a mid-MHz stream of full-resolution samples exists, the ADC simply picks some out at the final rate to send on, and throws away the rest. Of course there are optimizations, like only running the lowpass when its output will actually be used, with a block of inputs that was buffered up in the meantime, but you get the idea.

All of this is done in the converter chip, with dedicated single-function hardware, not in the DSP. It also sheds a different light on the concept of sample rates in general, and why it doesn’t make as much sense as people claim, to use a higher sample rate like 96kHz or 192kHz.

When you do that, you’re actually changing modes in the converter chip itself. The mid-MHz physical sample rate does not change. Likewise, the analog lowpass does not change. What changes is the digital lowpass. But it’s not the corner frequency that changes. It’s the rolloff rate. It’s still attenuating just above audible! Just not by as much, and that less-aggressive rolloff produces fewer samples of latency in addition to the time per sample being less.

That - latency - is the reason to use a higher sample rate, not ultrasonics. For a recording that is intended for human ears with no speed changes, a high sample rate makes no difference whatsoever.

If you’re doing some scientific analysis, like bat calls or whatever, you ideally need a specialized converter that doesn’t insist on attenuating all non-human frequencies no matter what you tell it. Or at the very least, understand how your human-audio converter actually responds up there, and EQ it back to somewhat flat.

4 Likes

Wow, okay, tons of extremely useful information here! An embedded solution was gonna be what I attempted next if I couldn’t get this to work, but I had a feeling I’d be in way over my head (which you’ve just confirmed). I’m far more comfortable in embedded-land (and even if I could afford it, I refuse to pay $600+ dollars for all of the Dante gear I’d need for minimum functionality :​P). I’m no DSP engineer though so that’s all uncharted territory for me. Good to know I’m at the top of Dunning-Kruger curve on that one (and soon to be in the valley of despair lol).

I will clarify just in case I wasn’t clear though, I know the sampling theorem says above 44.1K are pointless for humans. I am using high sampling rates specifically for latency, nothing else. I don’t intend to use the ultrasonic communication among the local rat population in my mix lol

Anyways, thanks for taking the time to write this for me!
–Tyler

1 Like

Okay, good to know! I did use rtcqs for my initial tuning and it was very helpful. It says that my USB controller isn’t sharing any IRQs so I think thats a good sign. My prototype system only has the one bus I think, but I’ll have to see about the Pi 5 once I get to that point.

Okay I just got to try this and that bought me an extra .2ms! Think I’m getting pretty close to optimum performance now. Thanks!

1 Like

Ah yeah, that’s true, rtcqs only displays whether the IRQ is shared or not. It doesn’t show if there are any other devices connected to it, you’ll have to check that with lsusb and see if your USB interface sits on the same bus with any other device. Sorry for the confusion…

But no shared IRQs in your case, that’s OK, but bear in mind that systems with MSI(-X) interrupts (most modern systems) don’t have shared IRQs anymore (unless you boot the kernel with the nomsi option which is quite educational).

Pi 5 has 4 separate USB buses so just pick one that doesn’t have any other devices connected to it.

1 Like

If you do end up going embedded, I might recommend this:
https://www.xmos.com/usb-multichannel-audio/
https://www.xmos.com/audio-dsp
I have a project in the works that’s exhausted a Raspberry Pi Pico (RP2040) and a Teensy 4.1 (i.MX RT1062), and that’s my next in line to try. It’s weird, as embedded things go - practically no peripherals, but a ton of parallel cores to bit-bang your own, or do whatever you want with 'cause they’re all the same - but they do seem to document it well, which is a big plus compared to most others. :angry:

It’s definitely off-topic for this forum though, so I’ll leave it at that. :slight_smile:

1 Like