I was using Ardour for spoken word editing recently. So even though it is audio editing, one thinks in terms of phrases, words, and perhaps speakers — just like for music one naturally thinks in beats, bars, phrases, instruments.

Besides the actual editing, I spent most of the time navigating the waveform, playing a second or two, locating the spot in the transcript.

It would be immensely useful to be able to import speech-recognition output (whisper AI), which includes time marks, and have the text displayed along the waveform. I was tempted to try lua script to place range cue markers (that should be possible, right?) but I realized it would be too crowded in zoom-outs, plus it would be only the beginning of the work/phrase which would be positioned correctly.

Having range cue ranges (with some level-of-detail intelligence, such as ellipsizing or completely hiding) would be awesome.

