Olaf Janssen, Wikimedia coordinator of the KB national library of the Netherlands
Latest update: 5 November 2024
https://doi.org/10.5281/zenodo.14047913
I used to think that ‘doing things with AI’ was equivalant to smoking data centers, overheated servers, and massive cloud computing power. But this month, I had a jaw-dropping WTF OMG tech discovery: realizing that some AI tasks can run smoothly on a modest laptop, and even offline!
I was searching for a solid solution to convert speech from a video file into text (also known as audio transcription, speech-to-text, or Automatic Speech Recognition, ASR) and found that this can all happen right on my own machine.
Using a recent video presentation I recorded, I wanted to apply ASR for several reasons:
Of course there are all kinds of existing audio-to-text cloud services, but they come with various downsides, including:
For my little ASR project, I wanted to avoid these disadvantages as much as possible.
As I work with ChatGPT regularly, I had heard of Whisper, OpenAI’s speech-to-text model, but I never actually looked into it or used it. So I thought I’d give it a try!
After some research to see if Whisper would suit my ASR needs, I found out that this model excels in Dutch, but it also performs very well in English.
So that sounded very promising. But (as far as I know) Whisper doesn’t offer a user-friendly front end, so I had to work with the API and Python. Fortunately, I found this short blog post to help me get started, and, combined with the documentation, it was quite straightforward to set things up.
Later in this article, you’ll see what I ultimately created with it, along with ready-to-use Python code so you can try it out for yourself.
To use the Whisper API with Python, you’ll need to install FFmpeg on your laptop. This WikiHow guide provides clear, step-by-step instructions for setup. I followed it on a laptop running Windows 10 Pro, and here’s what the setup looked like once completed.
When you run this piece of Python code for the first time,
the ‘large’ model is downloaded to your machine once. (See here for the available models.) To my great surprise, this turned out to be just a single 3GB file, handling all speech-to-text tasks, without needing any further internet connection. So no smoking data centers, overheated servers, or massive cloud computing power, but just a file on your own computer that you can use offline. Best of all, it’s great for privacy, as all processing happens entirely on your own device, ensuring your data stays private and secure.
Here’s a screenshot of the model on my home laptop. What happens inside that .pt file is pure magic!
Does transcription run at a reasonable speed? With the ‘large-v2’ model I’m using, transcription operates at roughly real-time, so a 15-minute audio file takes about 15-20 minutes to process. Smaller models, like ‘base’ and ‘medium,’ are faster but typically produce lower-quality transcriptions.
Besides Whisper’s offline capabilities, I am utterly amazed by the quality of the generated text. I can show this best through this (rather dull and quite lengthy) test video in which I used myself as the test subject:
The unformatted text block in the file description was generated entirely by Whisper, with only minimal human post-corrections. Take note of how accurately it handles named entities, technical terms, and proper capitalization, truly impressive!
In the video, you can tell I wasn’t making an effort to speak clearly, loudly, enthusiastically, or fluently. Yet, despite these less-than-ideal inputs, Whisper still managed to produce a fantastic transcription using just that 3GB .pt file (and FFmpeg). Absolutely amazing!
And the subtitles (closed captions) you see in the video were also completely generated by Whisper, in which all timings are spot-on as well.
To share my knowledge and code, I created the GitHub repo https://github.com/KBNLresearch/videotools
The relevant module is transcribe_audio.py, which is run from runtools.py, the main function of this repo.
If you want, you can have the audio transcript corrected by ChatGPT, for which I made an initial setup in ai_correct_audiotranscripts.py. To use this, you’ll need an OpenAI API key. But please note that you’ll lose the privacy advantage and offline use, as the ChatGPT models are far too large to run on a personal laptop.
As a side product, I also created a few other video and audio tools that only require FFmpeg, without a need for Whisper or ChatGPT.
Since this was just a first experiment with this new piece of AI for me, I’d love to hear your questions, feedback, tips, etc. You can find my contact details below.
The Videotools repo is developed and maintained by Olaf Janssen, Wikimedia coordinator @KB, national library of the Netherlands. You can find his contact details on his KB expert page or via his Wikimedia user page.
All original materials in this repo, expect for the blog article header, are released under the CC0 1.0 Universal license, effectively donating all original content to the public domain.