Requirements for DAWs for a patent unencumbered Stem file format

I have played around with Matroska audio files. (.mka) with cool results. I had multiple recordings of Pink Floyd - Animals. The original LP (which I recorded from a nice original copy), the first CD release, and the 2018 remix. I combined them all into one .mka file and was able to closely sync them and easily switch between the streams for comparison. All the streams were flac files. Pretty cool stuff!

The LP stream is 24/96, the CD is of course 16/44 and the 2018 release is in 24/192. Having several versions of the same album and being able to switch between them is really nice.

I’ve gotten some feedback on a recent draft that I haven’t implemented yet, but I’m curious if it’s something Ardour has a way to do (assuming it were to support something like this in the future):

Instead of storing the individual stems as completely separate tracks capable of being played independently, it was suggested that I store them in the same way that stereo audio is stored where the mono audio is stored alongside a diff to the stereo audio that allows the left and right channel to be reconstructed by adding or subtracting the diff from the mono channel. In the stems format this would result in the final mixdown track being stored as it already is, and each stem being a diff from that mixdown track.

I like this idea and have been playing with it in a recent draft, but I’m curious if this would be harder or require significantly more work to export for DAWs like Ardour? It simplifies a lot of issues for playback, but if it comes at significant cost when exporting and creating the stem file I’d love to know.

Thanks all!

Isn’t that just multitrack but complicated?

It is certainly more complicated to generate, but would solve a lot of problems with playback (ie. making fading between the mixdown and stem tracks seamless whereas right now it’s hard not to have a noticeable audio difference when you switch). Similarly, if you’re mixing with stems and each stem track is completely separate and you want to eg. raise the pitch or change the tempo, you have to do that for each stem, requiring up to 4 times the CPU (this is currently one of the big problems with the Native Instruments format). Doing it this way would save a significant amount of processing power and make it just as easy as when you pitch shift a normal stereo track.

Wouldn’t the diff files just be the tracks but phase inverted? Removing a single track would just mean subtracting a track from the master? So when you want to listen to a single track you’d subtract all others? And applying pitch correction you’d need to apply it to all tracks anyways? Sorry, I really can’t follow you, I think I’m missing a piece of information here :dizzy_face:

Which file format stores stereo that way? Almost every stereo format has two independent audio channels. Some perceptual encoders have a mode which uses M/S encoding as a way of saving space, but that works for closely correlated material like a final stereo mix, it doesn’t seem like it would be an efficient way of saving e.g. a drum track and a guitar track that barely have any signal correlation.

It would require a signal processing flow that doesn’t exist today, whereas stem export is something already available.

If you had difference files you would have to process the original file and each difference file, so it doesn’t save any processing requirements.

My thought process is that right now in the software I was thinking of if we want to play 3 of the stem tracks and remove vocals, then we do a pitch shift we have to pitch shift each track then mix them, ie. 3x the CPU usage over just doing the mixdown track. With this format you would have to scale the mixdown track, scale the vocals track, then mix the negative vocals track so you’d wind up with only 2x, for example. In the pathological case where you have all 4 stem tracks enabled, right now we have to use the stems and not the mixdown track because switching between the two can cause audible issues (since the mixdown track has mastering applied and the stem tracks can’t). So we have 4x CPU usage. With this system, however, we can actually just play and scale the mixdown track.

Yah, I don’t know how well this would actually work, it’s just an idea. We’d have to try it out first. I’m not sure what formats do and don’t use this off the top of my head, I was under the impression that it was fairly common, but that may have been an assumption just based on the fact that I’d heard of it before at all. There was a Microsoft stereo layout a few years ago (possibly defunct now?) that I thought did this, as well as some of the surround sound stuff but I can never keep that straight. Anyways, I could be very wrong and it’s less common.

EDIT: looked it up, all Dolby Surround does this using a proprietary matrix apparently. Not stereo which is what I was saying, but it’s the same idea.

That was for encoding a 4-channel or 5-channel surround mix down to 2 channels. Again, closely correlated material like ambience, didn’t have enough separation for independent tracks, which is what you want for separate stems.

If you have any links to more info on this, I trust that you all probably know more about it than me but I’d be curious to read up on it more. It’s not just used for ambience and what not so I’d be curious if anyone has written about what limitations they ran into with this.

EDIT: actually, thinking about this more I’m describing something a bit different than what Dolby Surround is doing. Instead of doing a matrix mix down to fewer channels you were correct that I think we’d just be storing a mix of all stems, plus the inverse for each stem. This means that in the common case (transition is done, all stems are playing) you can pitch shift just the mixed track instead of all 4 stem tracks. If you want to mute a single stem you’d play the inverse stem track as well, so eg. muting vocals means you’d play the Main track and the vocals track (2x CPU usage to do a pitch shift instead of the 3x you’d expect if you were just playing back all 4 stems).

You could also save some space by leaving out one of the inverse stem tracks, so you’re still storing the same number of tracks. Eg. if you had a synths stem and you left out the inverse, you could still subtract all the other stems to get just the synths from the main track, or mute the stems by playing a mix of all the other stems and muting the main track. This would involve more CPU usage to pitch scale in that case, but at the trade off of smaller file sizes. I think for DJ software this might be much better for playback in the common case, but I’ll keep discussing this with the Mixxx folks and make sure the system actually works for them since it would add more complexity on the export side that Ardour would probably do.

Wiki has some pretty good articles.

If your mixdown track has mastering applied, wouldn’t you need to apply the exact same mastering on each inverted stem as well, otherwise you’d never get an acculate diff to completely remove the individual track? Without the mastering, the stem is no more a perfect diff.

Also, you could always sum all stems then pitchshift the single track?

Edit: I kinda forgot you’re talking about DJing, not sound engineering. Tho the math and processing rules the same, the scale (track count) is smaller I suppose. For accurate diffs you’d still need to be very carefull and thoughtfull when creating them. You might save processing power in some cases, others are identical, I guess.

Removing an inaccurate diff from a track, you’d get artifacts, ghosting, noise and whatnot. Adding an inaccurate track to a mix, you won’t get the perfect final mix again (as you said earlier), but it would at least sound natural and complete without unwanted artifacts.

At least, that what I think.

Indeed; right now as I experiment with this I’ve got the original final-mix as the first track (for regular media players, eg. if you open it in VLC this track gets selected and you can listen to the original song), then the stems track include a mixed stems track and 3 of the other stems that have been phase inverted. You don’t want the mastering applied to the individual stems I don’t think, it’s just going to sound odd (at least, as far as I can tell). At the end of the day though the existing stems formats don’t do mastering either (or, like NI’s format, have a sort of lesser mastering step where you can license a specific DSP from them and use that).

I’m honestly not sure why in the DJ software I’ve been talking to they don’t do the stem mix, pitch shift that, then do the final mix. I’ll ask.

Thanks all, this has been very helpful as I think through whether this new scheme is actually worth it or just unnecessary complexity.

EDIT: the reason the software doesn’t do the mix and then pitch shift is because sometimes individual effects need to be applied to each stem and this has to happen post pitch-shift so that the effect is not also shifted, so you are stuck shifting all four stems.

1 Like

It would never work as expected with most mastering chains. Compression or limiting reacts much differently to a summed complete mix than to individual instruments or small groups of instruments. Similarly any kind of multi-band or dynamic EQ is going to react much differently to the final mix than to stems.

Simple addition (mixing) is very low processing power, and inversion is completely trivial processing power. Having a mixed stems track along with a few phase inverted tracks seems pretty pointless when you could just have all the stems and mix them together as desired.
This seems overly complicated for no benefit.

My general approach to something like this is to work out the math involved, and then show from the math why it is an improvement over the straight forward approach. Until you can do that you don’t really have a deep understanding what you are suggesting. Having the math written out will also make it easier to explain to other people without a lot of hand waving and imprecise verbiage.

Exactly.

Correct. To be clear I’m not super concerned with how it’s actually stored (ie. if they’re stored inverted or we do one additional multiplication, it will never matter). The point is whether we should store a cached version of the pre-mixed stems or not.

Anyways, it’s just an idea that a few of us were toying with so I mentioned it here, you’re correct that I don’t know if it actually makes a big enough difference to matter yet.