Speech to Text is finally ready – Whisper review.

Many years ago I developed and published an Android app called “Made Up Stories” which I used to record the bedtime stories I told to my young son. Many of the stories were sanitized versions of movies and novels, but I also had a lot of fun creating my own characters and plots on the fly. Years later, I have a collection of over 300 stories saved, and I started to think that it would be nice to turn some of the original ones into illustrated children’s books – but the time it would take to transcribe these audio files was daunting. I am a reasonably fast typist, but not quite up to the speed of spoken word.

Transcription software

In 2019/2020 I decided to try out transcription software, so I fed a sample audio file into Mozilla Deep Speech and Google Speech to Text, among others. Unfortunately, due to my South African accent, none of the transcribers I tried had any accuracy, and I shelved the idea. Literally every second word was incorrect. Also, Deep Speech required a very specific audio sample rate, and Google Speech to text isn’t free (I had a free trial). *they may have improved by now, I didn’t check.

Discovering Whisper

Last week I saw an article on The Verge, which was singing the praises of “Whisper“, the new open source transcription engine. It was incredibly easy to set up, and best of all, ran entirely on my laptop, for free. After installation it is a simple one-liner on the command line to get your audio file into text format. Best of all, the thing is really accurate – I read in their paper that the model was trained on a more diverse range of speech, and uses a different approach to others. Well, for me at least, it works!

There is still a lot to do with the transcriptions, which come out as one long line of text – with barely any punctuation.. And don’t get me wrong, there are still some errors (admittedly I was using the “tiny” model – only 350mb, the “Large” model is apparently even more accurate but takes up 5-6GB of space.)

Finishing touches

Well, I have cleaned up the first story, “Pinky the Amazon river dolphin visits the Mesozoic”. My wife is an amazing artist, and has agreed to illustrate, but I wonder, could AI help with that as well?

Dinosaurs on an ancient river bank – Craiyon generator
Underwater cave – Stable Diffusion

Maybe not.. (The dolphin images that I have managed to generate so far have only succeeded in making me feel ill.) I have signed up for “Midjourney” (which is DOWN at the time of writing). Apparently it is the best of the bunch – so who knows, maybe I can automate some backgrounds to save time illustrating the stories.

*Update: I signed up for DALL-E 2.

DALL-E 2 “Dinosaurs on an ancient river bank”
DALL-E 2 “Underwater cave scene”

I’m definitely going to do a more in-depth comparison of all of these image generators in future!

Leave a Reply

Your email address will not be published. Required fields are marked *