Deciphering Life: One Bit at a Time - Transcribing A File Using carelesswhisper

TL;DR

You can iterate through an audio file and transcribe it using the new carelesswhisper package (coolbutuseless 2023).

carelesswhisper

If you don’t follow coolbutuseless on Mastodon, you are missing out, especially if you want to know about really intriguing R packages (coolbutuseless, n.d.). Just this week, they released a new R package built upon the whisper C library (coolbutuseless 2023). Although I don’t do any actual work with audio, especially human audio that one might want to do automatic speech recognition or transcription with, I dug into it to see if it would be possible to actually transcribe a full audio file.

Setup

We are going to need a few packages besides carelesswhisper.

# for transcription
library(carelesswhisper)
# for converting from m4a
library(av)
# for audio processing
library(tuneR)


Attaching package: 'tuneR'

The following object is masked from 'package:av':

    sine

We also need an audio file. I decided to download an episode of Holderness Family Laughs (Family 2023), and see if we can transcribe it. I used yt-dlp to download and convert to audio (“YT-DLP” 2023).

yt-dlp https://youtu.be/DFM4Ey5oyhI -x --audio-format=m4a

Conversion

Now, we don’t want to convert the whole thing, because it creates a giant piece of audio that will then have to be loaded into memory. Just our 5 minute Holderness episode is 69 MB on disk. So for longer pieces of audio it would be even worse. However, we can do piecewise conversion and transcription over the length of the file.

When we do the conversion, we do have to keep a couple of things in mind about what carelesswhisper expects. It wants single channel, 16 KHz files. This can be done during conversion. The other thing it expects is a range of values between -1 and 1. This we can take care of after conversion.

So we will write a function to do the conversion, read in the file, and do the transcription. We will also overlap the time steps by 2 seconds, so that if we cut a word off in a previous conversion, then we should get the full thing in the next one.

# generate time indexes for transcription
split_time = function(audio_length)
{
  n_piece = ceiling(audio_length / 30) + 2
  out_time = vector("list", n_piece)
  start_time = 0
  for (i_piece in seq_len(n_piece)) {
    if (start_time > audio_length) {
      break()
    }
    out_time[[i_piece]] = c(start = start_time,
                                         end = min(c(start_time + 30, audio_length)))
    start_time = start_time + 28
  }
  
  not_null = purrr::map_lgl(out_time, \(x){!is.null(x)})
  out_time[not_null]
}

# rescale the values in the wav object
rescale_wav = function(wav_object)
{
  wav_values = wav_object@left
  wav_range = max(abs(wav_values))
  wav_sign = sign(wav_values)
  wav_fraction = abs(wav_values) / wav_range
  wav_convert = wav_sign * wav_fraction
  wav_convert
}

# do the actual conversion and transcription
inner_convert_transcribe = function(time_index, audio_file, whisper_model)
{
  out_wav = av::av_audio_convert(audio_file, output = "tmp.wav", channels = 1,
                                 sample_rate = 16000, start_time = time_index["start"],
                                 total_time = time_index["end"] - time_index["start"],
                                 verbose = FALSE)
  in_audio = tuneR::readWave("tmp.wav")
  scaled_audio = rescale_wav(in_audio)
  transcribed = carelesswhisper::whisper(whisper_model, scaled_audio)
  transcribed
}

# convert an actual file
convert_transcribe = function(audio_file)
{
  # audio_file = "posts/2023-06-10-transcribing-a-file-using-carelesswhisper/20s vs 40s Vacation [DFM4Ey5oyhI].m4a"
  audio_info = av::av_media_info(audio_file)
  audio_length = audio_info$duration
  
  time_splits = split_time(audio_length)
  whisper_model = carelesswhisper::whisper_init()
  transcribed_data = purrr::map(time_splits,
                                inner_convert_transcribe,
                                audio_file,
                                whisper_model)
  file.remove("tmp.wav")
  transcribed_data
}

OK, so we have our four functions, lets see if we can run through the entire 5 minutes. And see how long it takes.

tictoc::tic()
audio_file = "20s vs 40s Vacation [DFM4Ey5oyhI].m4a"
holderness_transcribed = convert_transcribe(audio_file)
tictoc::toc()

292.956 sec elapsed

holderness_transcribed[c(1:3)]

[[1]]
[1] "I've accrue 3.4 vacation days after working at my job for an entire year. If we stay up the entire time, that's like a 10 day vacation. Okay, so we have the whole week off, but really it's 2.5 days. Who are we kidding? It takes me a couple of days to unwind and I'd like to come back by Friday so I can go grocery shopping at the laundry done. I mean, I'm not going to just jump back into my workway. After being around all you people, I'm going to need a day alone on my phone just in the dark. Okay, three bikinis, a cover up, and some heels. I'll pack, to know how to eat!"

[[2]]
[1] "heels, I'll pack to know how to, I need this bag. These are my good hats, I can't just put my good hats in with clothes, it'll ruin. This is my supplement, that's the bag full of stuff I know the kids are gonna forget because they pack like it is. It's like 12 pair of underwear, one pair of socks, and a sweater. We're going to the beat. This is the snack bag, we're not buying snacks when we get to the beat. I'm at pain, $10 for a bag of chips. Yeah. Gotta bring my good shampoo, that travel shampoo does not cut it."

[[3]]
[1] "who does not cut it. We need a shampoo bag. Yes, I do. I price line this hotel. If we split this one room between the 12 of us, it'll be super affordable. Let's go girls. Let's just get the same house we went to last year. I mean, I know that it costs more than a car, but everyone will get their own bathroom. Everyone needs their own space. For their special morning time, right babe? If we sleep, shoulder to shoulder, we can at least get four people on the bed. And easily eight on the floor, right here. Cook on Housekeeping and get"

So this took just longer than the time of the video itself, which honestly isn’t too bad. The conversion of small bits of file, and then re-scaling to work with whisper, doesn’t seem to be imposing much overhead. Is the transcription perfect? No. Is it pretty decent? I think so, especially given that we can do it all locally, from our own audio file. The only other free way I know to do this is using “Voice Typing” in Google Docs, which is only accessible from the Chrome browser, and means putting your audio through Google’s servers.

Also, carelesswhisper default is to use the smallest possible model. There may be other models that will work on your machine, depending on how much RAM you have available. The carelesswhisper README mentions how to use other speech recognition models from the whisper website (coolbutuseless 2023).

References

coolbutuseless. 2023. “Carelesswhisper.” https://github.com/coolbutuseless/carelesswhisper.

———. n.d. https://fosstodon.org/@coolbutuseless.

Family, Holderness. 2023. “20s Vs 40s Vaction.” https://youtu.be/DFM4Ey5oyhI.

“YT-DLP.” 2023. https://github.com/yt-dlp/yt-dlp.

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{mflight2023,
  author = {Robert M Flight},
  title = {Transcribing {A} {File} {Using} Carelesswhisper},
  date = {2023-06-11},
  url = {https://rmflight.github.io/posts/2023-06-11-transcribing-a-file-using-carelesswhisper},
  langid = {en}
}

For attribution, please cite this work as:

Robert M Flight. 2023. “Transcribing A File Using Carelesswhisper.” June 11, 2023. https://rmflight.github.io/posts/2023-06-11-transcribing-a-file-using-carelesswhisper.