Deciphering Life: One Bit at a Time - Transcribing A File Using carelesswhisper Pt II

TL;DR

You can really improve the quality of the transcription you get from {whisper} if you use a bigger, language specific model.

Previously On …

In my last post, I showed how you can use {carelesswhisper}, and the {av} and {tuneR} packages to quickly chunk up an audio file, and do audio transcription on each chunk, without having a large WAV file sitting on disk, or reading the large amound of WAV data into memory.

Improving Results

I did hypothesize that using a bigger model would improve things, and coolbutuseless also commented on Mastodon that bigger, or more language specific models rapidly improve the quality of the audio transcription. So, I thought I would try my hand at it and see what we can do.

targets

Some of the models take a long time to run, so this document is actually built as it’s own {targets} workflow, using the guidelines from the supplement of the the {targets} package manual (Landau 2023).

library(targets)
tar_unscript()

::: {.cell tar_globals=‘true’ tar_interactive=‘true’} ::: {.cell tar_globals=‘true’ tar_interactive=‘true’ tar_name=‘functions’ tar_script=’_targets.R’}

split_time = function(audio_length)
{
  n_piece = ceiling(audio_length / 30) + 2
  out_time = vector("list", n_piece)
  start_time = 0
  for (i_piece in seq_len(n_piece)) {
    if (start_time > audio_length) {
      break()
    }
    out_time[[i_piece]] = c(start = start_time,
                                         end = min(c(start_time + 30, audio_length)))
    start_time = start_time + 28
  }
  
  not_null = purrr::map_lgl(out_time, \(x){!is.null(x)})
  out_time[not_null]
}

# rescale the values in the wav object
rescale_wav = function(wav_object)
{
  wav_values = wav_object@left
  wav_range = max(abs(wav_values))
  wav_sign = sign(wav_values)
  wav_fraction = abs(wav_values) / wav_range
  wav_convert = wav_sign * wav_fraction
  wav_convert
}

# do the actual conversion and transcription
inner_convert_transcribe = function(time_index, audio_file, whisper_model)
{
  out_wav = av::av_audio_convert(audio_file, output = "tmp.wav", channels = 1,
                                 sample_rate = 16000, start_time = time_index["start"],
                                 total_time = time_index["end"] - time_index["start"],
                                 verbose = FALSE)
  in_audio = tuneR::readWave("tmp.wav")
  scaled_audio = rescale_wav(in_audio)
  transcribed = carelesswhisper::whisper(whisper_model, scaled_audio)
  transcribed
}

Run code and assign objects to the environment.

::: {.cell tar_interactive=‘true’ tar_globals=‘true’} ::: {.cell tar_interactive=‘true’ tar_globals=‘true’ tar_name=‘test-target’ tar_script=’_targets.R’}

tar_target(test_output, message("Hi!"))

Run code and assign objects to the environment.

<tar_stem> 
  name: test_output 
  command:
    message("Hi!") 
  format: rds 
  repository: local 
  iteration method: vector 
  error mode: stop 
  memory mode: persistent 
  storage mode: main 
  retrieval mode: main 
  deployment mode: worker 
  priority: 0 
  resources:
    list() 
  cue:
    mode: thorough
    command: TRUE
    depend: TRUE
    format: TRUE
    repository: TRUE
    iteration: TRUE
    file: TRUE
    seed: TRUE 
  packages:
    targets
    stats
    graphics
    grDevices
    datasets
    utils
    methods
    base 
  library:
    NULL

Original Functions

Here are the original functions we used.

Updated To Use Different Models

Our wrapper function, we are going to modify slightly to be able to use a different model.

# convert an actual file
convert_transcribe = function(audio_file, model_file)
{
  t_start = Sys.time()
  audio_info = av::av_media_info(audio_file)
  audio_length = audio_info$duration
  
  time_splits = split_time(audio_length)
  whisper_model = carelesswhisper::whisper_init(model_file)
  transcribed_data = purrr::map(time_splits,
                                inner_convert_transcribe,
                                audio_file,
                                whisper_model)
  file.remove("tmp.wav")
  t_end = Sys.time()
  list(transcription = transcribed_data,
       elapsed = as.numeric(difftime(t_end, t_start, units = "secs")))
}

audio_file = "20s vs 40s Vacation [DFM4Ey5oyhI].m4a"
model_files = dir(".", pattern = "bin$")

model_results = purrr::map(model_files, \(x){
  message(x)
  convert_transcribe(audio_file, x)
  }
)

References

Landau, William. 2023. “The Targets r Package Manual.” https://books.ropensci.org/targets/markdown.html.

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{mflight2023,
  author = {Robert M Flight},
  title = {Transcribing {A} {File} {Using} Carelesswhisper {Pt} {II}},
  date = {2023-06-11},
  url = {https://rmflight.github.io/posts/2023-06-11-transcribing-a-file-using-carelesswhisper-pt-ii},
  langid = {en}
}

For attribution, please cite this work as:

Robert M Flight. 2023. “Transcribing A File Using Carelesswhisper Pt II.” June 11, 2023. https://rmflight.github.io/posts/2023-06-11-transcribing-a-file-using-carelesswhisper-pt-ii.