Less than five months after Microsoft announced that its Speech Recognition Technology had achieved a 5.9 percent word error rate (WER) which brings the technology closer to human records, IBM announced that it has in fact achieved a 5.5 percent WER thereby setting a new record for machines. The term word error rate (WER) refers to the accuracy achieved by a speech recognition or translation system and this is usually measured against the current 5.1 percent record is the human record.
IBM achieved the feat by combining Long Short-Term Memory (LSTM) and WaveNet language models. Remember WaveNet? Let me refresh your memory a bit. WaveNet was created by Google’s DeepMind and is able to generate speech that closely mimics the human voice. The LSTM unit is a recurrent network unit that excels at remembering values for either long or short durations of time and by learning from experience, it can predict time series faster.
IBM says that by combining both technologies, it was able to achieve a lower word error rate than Microsoft. The big challenge though here for both giants is human parity which is where they disagree. Microsoft has said its 5.9 percent WER is not at par with how an average human would in a speech recognition task but IBM says the industry standard is 5.1 percent which is where they would like to see it. This would be the same thing as two humans speaking and this is at the level the industry thinks speech recognition technology should aim for.
“Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the ultimate industry goal. Others in the industry are chasing this milestone alongside us, and some have recently claimed reaching 5.9 percent as equivalent to human parity…but we’re not popping the champagne yet. As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent.”