GPU Computing again....

Hi all,
I’ve been following the news for GPU-computing thats going on around OpenCL, nVidia’s CUDA and AMD streaming computing for a while now and it sounds very exciting for the realm of audio processing. Now there have been quite a number of posts here that were of the kind “hey we need GPU-computing”, but I was wondering where exactly the problems are and what the possible solution as well as possibilities (as to a possible usage of GPU-Computing) would be.
The obvious advantage of GPUs as has been stated everywhere is obviously its enormous processing power, when it comes to data-parallel tasks. That power is very applicable to digital audio processing, if I am not mistaken, which made me think about getting into a little coding in this respect (once I finish my thesis)…
Now Paul wrote the following in a post on GPU-Computing:
“GPGPU’s are very powerful but also very latency-inducing. They are not designed for use in realtime, low latency contexts. If you were doing “offline rendering” they would offer huge amounts of power, but their current design adds many milliseconds to delays to signal processing.”
I was wondering if someone with the knowledge could maybe elaborate a little further on that.
Ardour does have complete latency compensation (again if I am not mistaken) meaning that at least some of the audio effects could well be processed online on a GPU, as long as that compensation works correctly. Why would this not be viable?
I am not aware of how Ardour (or other programs for that matter) handles the effect processing and also not very familiar with LV2 or LADSPA (although I did read up on it a little).
Taking from the complexities that are being talked about, I suspect that the effect-plugins do not open their own threads on the CPU or do they? If they do, wouldn’t it be possible to code them in OpenCL and use latency compensation?
Just to clarify: This isn’t meant as a “please do this post”, but rather I am wondering whether there is any sense to my idea of trying this as a personal project…
Hope this all doesn’t sound too ridiculous, given that my knowledge of these things at the moment is still very rudimentary…
Thanks in advance for a reply,
Michael

Oh yeah.
And maybe someone could point me in the right direction to read up a little on these things…

http://www.gpgpu.org/index.php?s=audio would be a nice place to start I think.
As to the LADSPA and LV2 specs, they can be found at http://www.ladspa.org/ and http://lv2plug.in/ respectively.
They are both very simple to use, and have pretty good documentation (use the source Luke :wink: ).

Hi.
Thanks for the reply. I checked out http://www.gpgpu.org/index.php?s=audio Unfortunately most of the links there seem to be dead. I also shortly looked into some sample code of an LV2 (again). The thing I don’t understand (and that was my main question before) is how they work/interact with the host software. Can an LV2 plug-in open their own thread (i.e. do they have their own process) or do they somehow run “through” host. (I remember, for example, Paul saying that the Ardour audio processing section is not yet multi-threading capable. So does that only relate to the actual audio routing or also the effect plugins…)
I stumbled into this very interesting link:
http://www.kvraudio.com/forum/printview.php?t=222978&page=1
So at least with VST it seems to be possible to do GPU audio processing using CUDA. I haven’t tried out the plug-in yet, but I will, once I get around to it.
So maybe, if someone with insight into the deeper workings of LV2 (or the compiling process for that matter) could comment on this, I’d be grateful.

Couldn’t anyone just give a short reply on this?
I’d be very thankful,
Michael

For what its worth, this is my understanding of LV2:

The LV2 plugin is a shared library. The shared library contains a few functions to instantiate the plugin (set up various data structures etc) and also a ‘run’ function that is called by the host every time the it wants to process a block of audio samples. (The host calls the function with pointers to the buffers containing the audio and also specifies the number of samples to process). The function does its thing with the audio samples and returns. Essentially the shared library just provides a few functions that the host will call as and when it needs. When you load a plugin into the host, the host loads the shared library into memory and calls the various functions to instantiate the plugin. There is nothing to prevent you starting up a new thread from within the plugin when it is instantiated - or even fork / exec - ing another process but there are various rules governing what should / can execute in which thread - have a look in the lv2.h header which you need to compile an LV2 plugin - this pretty much encapsulates the API. I’m not entirely convinced about LV2’s ability to handle GUI extensions etc yet across all hosts - but this is a topic of some debate it seems. There is a FAQ that covers some of the issues I’ve encountered while trying to develop LV2 plugins on my site:

http://www.linuxdsp.co.uk

You can use the contact info on my site and I can give a more detailed description, the LV2 devs maybe able to shed some light on things - it looks to me as though LV2 is still evolving…

Quoted from: http://www.zamaudio.com/?p=380 “I have been working on a jack client that uses the GPU to process audio. The benefit of this is that it lightens the load off the CPU to process audio. I used the NVIDIA CUDA toolchain to create a jack-cuda client. Currently I have made a gain plugin that uses 256 parallel threads to amplify a jack audio stream in realtime. There is a bit of overhead because it copies the stream to video ram first, then processes the audio and copies it back to main ram, but the PCI-e bus is pretty fast so it’s still overall pretty fast.”

Doesn’t sound like a great solution. The newer AMD APU’s, offer a method of sharing the RAM between the CPU part of the chip and the GPU part of the chip, this would solve the latency issues, but would be limited to AMD only.

I seem to remember reading that there is some sort of effort to get CPU GPU concurrency going in the Linux kernel. I think on a very basic level the idea is that the resources available from both the CPU and GPU are placed into a single pool. And then as resources are required they are then drawn from said pool.
Combining concurrency and cgroups has the potential to allow hardware to be used more efficiently and allow access to more previously unavailable resources under heavy load, if for example a GPU expansion card is used in a machine, where the GPU would previously just be idling. The effects of concurrency on CPUs with hyper threading and also the combination of a GPU expansion card with a CPU which offers hyper threading will be partly interesting.
In relation to audio; I believe that cgroups will be useful in combination with concurrency to achieve low latencies.

GPU’s are still largely irrelevant for audio work. The latencies involved in moving data to and from the device are too great for real time work unde r most conditions.

CGroups are already part of the Linux kernel and have been for many years. They do not provide any features to improve latency or concurrency, they just provide mechanisms to manage resource allocation, and are already active and in use. CGroups are the reason that SCHED_FIFO scheduling is no longer dangerous on Linux kernels - they prevent even a SCHED_FIFO process from stealing all the CPU time.

“I seem to remember reading that there is some sort of effort to get CPU GPU concurrency going in the Linux kernel.”

What you are probably remembering is heterogeneous system architecture (HSA) which allows the CPU and GPU to share a single address space. That would in principle allow passing a pointer to a buffer from the CPU to the GPU, rather than copying the data from CPU address space to GPU address space as is currently done. Available now on some CPU’s from AMD.
It should improve latency a lot compared to the current segregated address space, but is probably still marginally useful for true real time operation.

A first design for GPU processed audio would probably be something like an external jack client, rather than a plugin, so that you could hide the specifics of passing data to the GPU, and could report the latency through the jack api.

Perhaps a useful first question would be in what situations do you run out of processing power currently, even when increasing the latency? It would not be fair to compare for example capabilities using CPU at 5ms or 10ms or 15ms latency to a GPU processing solution at 50ms or 100ms latency, so if you have a situation where you max out the CPU capabilities, the first step would be to increase the latency and see whether you still have the limit, or whether you are really just seeing the limits of low latency real time operation with that system.
Only after clearly stating the problem could you determine if GPGPU is a solution to that problem (if in fact there is a problem to be solved).

Latencies tend to be a bit misunderstood whenever somebody brings up the possibility of using general-purpose GPU programming (GPGPU). The latencies inherent to CPU<->GPU communication and transfer of control may be very driver/implementation dependent, but I was able to dig into GPGPU programming a bit last winter for some audio-related tasks, so I’ll share my findings.

Using a Windows 7 desktop system and a fairly new nvidia graphics card (sorry, I don’t know which model it was, but I think it was from 2013-2014), launching a CUDA kernel function that did nothing but fill a 256-entry float array on the gpu side and then copying that data to the cpu took somewhere around 600-700 microseconds.

I was using this to experiment with realtime additive sound synthesis. I found that the major issue with offloading the calculations to the GPU was extracting enough parallelism. Batching 512 samples to the GPU and allocating 128 GPU threads to fill this buffer with 128 sine waves of varying amplitude, phase and frequency was only utilizing an average of 1% of the device’s cores at any time and was actually performing worse than the CPU version (and could only be done in real-time if I lowered the sine-wave count to 32).

The main explanation for this is that individual GPU operations do have latencies, and the actual computational units depend on having a large pool of threads to switch to during the time that they’re waiting for these latencies. For example, if one thread is waiting for uncached data to arrive from ram, instead of blocking for 100 cycles, that core will find another thread whose next operation has no unmet dependencies and will run it. This happens at a very fine level - if you call the sin/cos intrinsic, which might have 4 cycles of latency on a given card, and you attempt to immediately use the computed value in the next instruction, then there are 3 cycles during which your thread is stalled. If you don’t have a large enough thread pool for the GPU core to draw from, then that specific core is idle 75% of the time.

So the solution is to have thousands of threads running simultaneously. Let’s say you want to try applying GPGPU to the mixer. You’ll take N stereo input tracks, multiply/pan them by a different constant for each track, and then sum their outputs. Perhaps the most natural way to partition this task is to operate over a large sample buffer and have each thread sum the outputs at a single point in that buffer. If you go this route, you’ll need at least a thousand threads to keep the GPU cores saturated and to actually see a performance increase over CPU code. This means you do the computation over buffers of size 1000 (23 ms at a 44.1 kHz sample rate).

Maybe you see an alternative way of extracting parallelism in this that might decrease the necessary buffer size. For example, if you have 16 tracks, why not allocate 8 threads per sample and take multiple passes? In the first pass, thread #1 will sum the output of tracks 1 and 2, thread #2 will take care of tracks 3 and 4, and so forth. At the end of that pass, mix the 8 outputs down into 4 (this time using only threads #1-4 - the rest will be inactive), then 4 into 2 and finally 2 into 1 (these type of algorithms are called “reduction algorithms”, by the way, since they take N inputs and reduce them down to M < N outputs). This is an improvement. There will be a bit more synchronization complexity, since you’ll have to make sure that thread 1 waits for thread 2 to compute its output of the first pass before it starts second pass, etc, but modern versions of CUDA have mechanisms for this. On average, there will be about 4 threads active per sample at any time, so you should be able to lower your buffer size down to maybe 250 samples (5.7 ms latency), which is a bit more acceptable.

Anyways, hopefully this will clear up any misunderstandings of where most of the latency comes from in GPGPU programming (mainly batching of data in order to extract meaningful parallelism). The typical GPGPU implementation copies a sample buffer from CPU -> GPU, launches N identical threads to do the processing on the GPU, and then copies the output buffer from GPU -> CPU. Newer versions of CUDA support streaming mechanisms, whereby it’s possible to provide the threads with more data after they’ve already been launched, which might allow for latencies closer to that of two memory copies. I haven’t experimented with that feature, so I can’t say much useful about it. I also have no idea how the numbers compare with OpenCL or operating systems other than Windows, as I was unfortunately restricted to working with Visual Studio/Windows during my project.