How I lowered my plugin's CPU by 66% with one fix
- Niccolo Abate
- Oct 4, 2023
- 4 min read
Updated: Nov 7, 2023
In another article, I discussed how to write performant realtime audio code. In this article, I want to explain how I followed some of those tips to improve the performance of one of my projects, DelayCat, lowering my baseline CPU footprint by upwards of 66% with just one fix.
Breaking up Work Properly Between Threads
As I discussed in the other article, we want to make sure work is broken up properly between the different threads in our program, namely the audio thread and the GUI / message thread, such that the amount of work on the audio thread is minimized. In this example, I will show how I didn’t do this properly early in my development process, and then lowered my baseline CPU by 66% when I later identified and fixed the problem.
Measurements
The measurements were taken by looking at the CPU monitor in DAW (Reaper specifically). As I have discussed before, this essentially means, how much of the time allowed for any given block is being used. The measurements were taken with the plugin under typical, baseline parameters. The single core CPU load is the measurement we will be looking at, taken on my laptop (i7-6700HQ).
Before this improvement, DelayCat was using 13-14% of one CPU core (i.e. running on one core for 13-14% of the time allowed for an audio processing block). This is a lot of CPU for a plugin. As a research project and analysis based plugin, it wasn’t a deal breaker, but as a plugin I liked to use (and hope others will like to use) often with multiple instances, it was annoying. After this improvement, DelayCat is using 4-5% of one CPU core. This is a HUGE difference in the performance effect for the user, and a large increase to the number of instances that can be used before things become hairy. It also allows the overlap and resolution of the FFT to be changed without blowing up the performance as it once did.
Background
DelayCat is an advanced plugin which leverages FFT based audio analysis as part of the DSP processing, while also using the FFT to display a spectrogram in the GUI (as well as a number of other analysis features). I had narrowed down the performance footprint of my plugin and seen that the FFT was by far the largest contributor to the baseline audio thread CPU usage, but I didn’t realize just how much of that was due to graphical processing code that could be factored out and moved off the audio thread.
Problem
The problem was introduced when I first was implementing my spectrogram code, inspired by some examples online. The examples I was looking at were only using the FFT for the spectrogram, meaning the FFT and all the graphics processing were being handled on the GUI thread (correctly); all that was being done on the audio thread was copying samples into a buffer and handling some synchronization. However, for DelayCat the FFT data is the foundation of the audio processing, so the FFT processing had to be moved onto the audio thread. I adapted the code for my solution, but ended up bringing the graphics processing code onto the audio thread as well, favoring quicker prototyping over figuring out a more sophisticated threading solution and not thinking much of it or not noticing / worrying about the performance at that stage of development. This graphics processing included translating between frequency bins and spectrogram pixels (non-linearly) and, expensively, shifting the spectrogram image every FFT frame, as well as plotting feature extraction data in a similar manner.
The performance issue became more drastic later as I increased the number of FFT frames by adding an overlap of 50% and added more features to the feature set, and more noticeable as other parts of the program were optimized, until I finally located the issue. One correctly identified (half the optimization battle), it was only a few hours of work to properly factor out the graphics code, resulting in the upwards of 66% performance reduction I flaunted above.
Solution
In order to factor out the unnecessary graphics processing, I had to disentangle parts of the existing classes, stripping down the audio thread to just the FFT and analysis, and a short function copying the relevant data into a buffer synchronized with the GUI thread. The audio thread is then free to proceed, having the analysis it needs, while the GUI thread can process the copied data at its discretion. If the graphics happens to not be finished with the frame by the time the next frame is ready on the audio thread, the audio thread will just move on.
Simple solution! But huge performance gain. This illustrates the importance of properly breaking up your work between your threads, and hopefully helps to understand a typical problem in solution you might encounter in your own plugin / audio software.
Code
I am going to show some (abstracted) copies of my code below to illustrate the changes I made, working with the Spectrogram class (audio thread) and the SpectrogramComponent class (GUI thread). The classes are bowled down extensively for illustrative purposes.
Before
After
You will now see that the spectrogram class just copies over the necessary data and then lets the component class handle all the data and image processing. An extra atomic is also added to enforce protection of the buffer which is used to pass the data between the two classes.
Comentarios