A peek under the hood of Codemill’s speech-to-text

By Vidispine - May 3, 2015 (Last updated: April 3, 2017)

Codemill have integrated and tested a couple of speech-to-text technologies into Vidispine transcoders so that the audio can be transformed to metadata. Here are a few insights from that work.

Speech recognition, how hard could it be? Speech consists of a bunch of sounds uttered one after the other, basically corresponding to written characters, and together they form words. What more could there be to it?

Quite a lot, it turns out. To be sure, the basic idea outlined above is not wrong, but the complications are vast. A word or part of a word can be pronounced very differently depending on context (the words said before and after it, if you will). Conversely, different words and parts of words may sound the same depending on context. It turns out that understanding the meaning of the words is very important for identifying which words are spoken, both for humans and machines – otherwise, we wouldn’t be able to hear the difference between “I speak English” and “Eyes peek English”.

Speech recognition – decoding the speech signal – is usually done in two steps: acoustic and language modelling. Both the acoustic and language model are unique for a specific language and accent. The acoustic model is responsible for decoding the audio signal, turning it into a set of phonemes – parts of words. The language model tells how to map a succession of phonemes to a sentence: It encodes how sentences are built up in the language. Putting up an acoustic and a language model – training – requires data sets consisting of hundreds of hours of recorded speech, divided into short files, and corresponding transcripts. It’s computationally intense, but luckily, it only has to be done once, and good open-source models are put up on the web for download.

It’s all probabilistic: For a given set of phonemes, there is more than one possible sequence of words, taking into account all the ambiguities in the steps of the decoding. The program has to work out the most probable set of words with the help of the language model, and if needed (for example for keyword search), second-best, third-best or more alternatives can be listed.

A common type of language model is based on so-called trigrams. This is what Codemill’s basic s2t engine is built on. Trigram means, basically, three words: Instead of specifying the grammar of the language, which is not an easy task (though research is active in that area), one specifies the probability that three given words occur after one another: “I speak English” is much more probable than “Eyes peek English”. This makes for a relatively simple and fast algorithm and a language-model file that in principle contains an enumeration of three-word combinations with associated probabilities. The downside is that the output is not guaranteed to be grammatically correct. The reader is invited to try to think of a sentence that is not grammatically correct, though every three-word sequence is valid.

The future is, as you know, ahead of us: We are watching closely as speech technology is evolving, and trying to figure out how to best put it to use for you.

The more general you want your speech engine to be, the harder it is to get good accuracy. Learning a limited set of commands, with a vocabulary of, say, 100 words, can be done almost perfectly. Learning a single voice is significantly easier than handling any voice. Short commands is easier than continuous speech. But if you want to subtitle a film, you want to do what experts call LVCSR – large-vocabulary continuous speech recognition, and this is still largely an unsolved problem (especially with noise on top of it!). Luckily, it is possible to work backwards, starting from a general acoustic model and adapt it to, for example, a single voice, making sure it works better for that particular speaker.

There is much more to this story: Instead of trigrams, there is – of course – 4-grams and higher. A current trend is to use deep neural networks for the acoustic model, which increases accuracy significantly but at the present requires too much disk space and memory to be used in smaller devices.

The future is, as you know, ahead of us: We are watching closely as speech technology is evolving, and trying to figure out how to best put it to use for you.

If you want to know more contact them at CodeMill, and if you want to try it out for free, do sign up at VidiXplore.