library(targets)
tar_unscript()
TL;DR
You can really improve the quality of the transcription you get from {whisper}
if you use a bigger, language specific model.
Previously On …
In my last post, I showed how you can use {carelesswhisper}
, and the {av}
and {tuneR}
packages to quickly chunk up an audio file, and do audio transcription on each chunk, without having a large WAV file sitting on disk, or reading the large amound of WAV data into memory.
Improving Results
I did hypothesize that using a bigger model would improve things, and coolbutuseless also commented on Mastodon that bigger, or more language specific models rapidly improve the quality of the audio transcription. So, I thought I would try my hand at it and see what we can do.
targets
Some of the models take a long time to run, so this document is actually built as it’s own {targets}
workflow, using the guidelines from the supplement of the the {targets}
package manual (Landau 2023).
::: {.cell tar_globals=‘true’ tar_interactive=‘true’} ::: {.cell tar_globals=‘true’ tar_interactive=‘true’ tar_name=‘functions’ tar_script=’_targets.R’}
split_time = function(audio_length)
{
n_piece = ceiling(audio_length / 30) + 2
out_time = vector("list", n_piece)
start_time = 0
for (i_piece in seq_len(n_piece)) {
if (start_time > audio_length) {
break()
}
out_time[[i_piece]] = c(start = start_time,
end = min(c(start_time + 30, audio_length)))
start_time = start_time + 28
}
not_null = purrr::map_lgl(out_time, \(x){!is.null(x)})
out_time[not_null]
}
# rescale the values in the wav object
rescale_wav = function(wav_object)
{
wav_values = wav_object@left
wav_range = max(abs(wav_values))
wav_sign = sign(wav_values)
wav_fraction = abs(wav_values) / wav_range
wav_convert = wav_sign * wav_fraction
wav_convert
}
# do the actual conversion and transcription
inner_convert_transcribe = function(time_index, audio_file, whisper_model)
{
out_wav = av::av_audio_convert(audio_file, output = "tmp.wav", channels = 1,
sample_rate = 16000, start_time = time_index["start"],
total_time = time_index["end"] - time_index["start"],
verbose = FALSE)
in_audio = tuneR::readWave("tmp.wav")
scaled_audio = rescale_wav(in_audio)
transcribed = carelesswhisper::whisper(whisper_model, scaled_audio)
transcribed
}
Run code and assign objects to the environment.
::: {.cell tar_interactive=‘true’ tar_globals=‘true’} ::: {.cell tar_interactive=‘true’ tar_globals=‘true’ tar_name=‘test-target’ tar_script=’_targets.R’}
tar_target(test_output, message("Hi!"))
Run code and assign objects to the environment.
<tar_stem>
name: test_output
command:
message("Hi!")
format: rds
repository: local
iteration method: vector
error mode: stop
memory mode: persistent
storage mode: main
retrieval mode: main
deployment mode: worker
priority: 0
resources:
list()
cue:
mode: thorough
command: TRUE
depend: TRUE
format: TRUE
repository: TRUE
iteration: TRUE
file: TRUE
seed: TRUE
packages:
targets
stats
graphics
grDevices
datasets
utils
methods
base
library:
NULL
Original Functions
Here are the original functions we used.
Updated To Use Different Models
Our wrapper function, we are going to modify slightly to be able to use a different model.
# convert an actual file
= function(audio_file, model_file)
convert_transcribe
{= Sys.time()
t_start = av::av_media_info(audio_file)
audio_info = audio_info$duration
audio_length
= split_time(audio_length)
time_splits = carelesswhisper::whisper_init(model_file)
whisper_model = purrr::map(time_splits,
transcribed_data
inner_convert_transcribe,
audio_file,
whisper_model)file.remove("tmp.wav")
= Sys.time()
t_end list(transcription = transcribed_data,
elapsed = as.numeric(difftime(t_end, t_start, units = "secs")))
}
= "20s vs 40s Vacation [DFM4Ey5oyhI].m4a"
audio_file = dir(".", pattern = "bin$")
model_files
= purrr::map(model_files, \(x){
model_results message(x)
convert_transcribe(audio_file, x)
} )
References
Reuse
Citation
@online{mflight2023,
author = {Robert M Flight},
title = {Transcribing {A} {File} {Using} Carelesswhisper {Pt} {II}},
date = {2023-06-11},
url = {https://rmflight.github.io/posts/2023-06-11-transcribing-a-file-using-carelesswhisper-pt-ii},
langid = {en}
}