Subscribe
About

Speech-to-text applications still fail to deliver

Sibahle Malinga
By Sibahle Malinga, ITWeb senior news journalist.
Johannesburg, 14 Sep 2022

Despite progress over the last three decades in automatic speech recognition (ASR) systems such as speech-to-text, many shortcomings remain, which limit the full potential of these natural language processing applications.

This was the word from professor Hermann Ney, director of science at global artificial intelligence (AI) and machine learning (ML) firm AppTek, speaking this week at the second Global AI Summit in Riyadh, Saudi Arabia.

Ney delivered a keynote titled: “What is next for automatic speech recognition and machine translation?” He pointed out that ML, when applied to speech and learning processes, is still marred by many inaccuracies, which pose dilemmas for users.

Before moving towards the next frontier of these software systems, which were first introduced in 1990, Ney noted voice technology developers still need to pay close attention to resolving several challenges, which include lack of accuracy, lack of robustness, a limited number of languages catered for, distinguishing voice from background noise, and the identification of different accents.

ASR is an AI-based technology that allows human beings to use their voices to speak with a computer interface in a way that resembles normal human conversation. ASR can also be used to generate live captions in video, or to transcribe audio into text.

Over the last few years, ASR applications such as such as Amazon Echo, Google Home, Cortana, Siri and others have become increasingly popular.

“While it’s true there has been huge progress with ASRs – due to algorithms and deep learning over the last 30 years – we still face two types of challenges. The first is that we still need to counteract the propagation of errors from ASR components to machine translation components,” he explained.

Secondly, he went on to say, there are only three types of data used to train the algorithms and ML software used to enable ASRs to fully function.

Feeding the AI

To resolve these issues, there needs to be training of larger datasets and good quality data that is used by the speech engine, he pointed out.

AI is data-hungry and ASR models require training and re-training of large datasets over long periods of time in order to learn a function, he added.

“Automatic speech recognition entails a fair amount of data, while a text-to-text translation machine component has a large amount of data. Speech-to-text, on the other hand, only has a small amount of data component. The solution is really to design algorithms that are able to exploit these three types of data as much as possible – which is not easy.”

Text-in-target language tools – written text translated into final language – also face numerous challenges, he added.

The primary errors in these systems usually include linguistic errors, expressive voice of the speaker (emotions, characteristics, accents) and misspellings.

“We need more real life data with large variables that take into consideration speakers’ dialects, acoustic sounds and emotions – the algorithms must be improved to handle these conditions. This means neutral networks should model probabilistic input dependency. Currently, they provide only one out of many possible variables, limiting the accuracy rate.”

Once perfected, there are many areas where text-in-target language could be applied on a wide scale, including dubbing for movies, making voice-overs for documentaries and in human to human communications, he concluded.

Share