Speech-to-Text Technology

Speech to Text Technology

In today’s world, there is more voice-based communication and collaboration happening than ever. People are no longer only typing now. They have speech-to-text transcription applications on their smart devices that allow them to transcribe everything they say. While speech recognition and transcription aren’t anew phenomenon, they have undergone a great deal of transformation over the years. The players in this domain who have been working hard in making this happen have achieved a great deal of accuracy in the technology recently.

There are several systems available that differ in capabilities, with some only able to recognize a selection of words and phrases. But the most advanced transcription software can understand natural speech and also provide its own accuracy measure. Yes, we’re talking about the speech-to-text capabilities of four big players: IBM Watson, Google Speech-to-Text, Amazon Transcribe, and Microsoft Azure Speech-to-Text.

Speech-to-text transcription technology has allowed developers to power voice response systems virtually everywhere, from call centers to financial institutions, hospitals to education institutes. Developers can also enable Internet of Things (IoT) devices to talk back to users and convert text-based media into a spoken format. Businesses have started finding the best possible use case of speech-to-text technology in their own scenarios and leveraging the capability of these giants who are increasingly interested in bringing their Artificial intelligence (AI)-powered tools to the enterprise.

Speech to Text Technology At Beyond Key, our voice technology expert Vishal Gupte dove deep to find out which of the four players carved the niche and offered the best version of this technology. To understand it better, let’s take a look at the matrix below which gives a quick glimpse of the features offered by each of these companies:

Features IBM Watson Google Speech-to-Text Amazon Transcribe Azure Speech-to-Text API
Speech Recognition
Real-time Speech Recognition
Speech to Text Technology
Speech to Text Technology
Speech to Text Technology
Keyword Spotting in Audio
Speech to Text Technology
Speaker Identification
Speech to Text Technology
Speech to Text Technology
Beta
Speech to Text Technology
Languages Supported English (US), English (UK), Japanese, Arabic, Mandarin, Portuguese (Brazil), Spanish, French, Korean 120 languages and variants English(US) and Spanish Different languages are supported for different Speech service functions; total of 30 languages supported as of now
Multichannel Recognition
Speech to Text Technology
Speech to Text Technology
Accuracy
Confidence Score
Speech to Text Technology
(word wise)
Speech to Text Technology
(word wise)
Speech to Text Technology
(word wise)
Speech to Text Technology
(transcription wise)
Start/End Time per Word
Speech to Text Technology
Speech to Text Technology
Speech to Text Technology
Noise Robustness
Speech to Text Technology
Vocabulary
Custom Vocabulary
Speech to Text Technology
Speech to Text Technology
(known as phrase hints)
Speech to Text Technology
Speech to Text Technology
Profanity Filtering
Speech to Text Technology
Speech to Text Technology
Smart Formatting
Speech to Text Technology
Speech to Text Technology
Speech to Text Technology
Speech to Text Technology
Punctuation
Speech to Text Technology
Speech to Text Technology
(Beta)
Speech to Text Technology
Speech to Text Technology
Support
Video Support
Speech to Text Technology
Speech to Text Technology
Audio Formats Supported WebM with the Opus or Vorbis codec, MP3, WAV, FLAC, PCM, u-law audio, and basic audio All audio formats FLAC, MP3, MP4, or WAV file format WAV
REST API Support
Speech to Text Technology
Speech to Text Technology
Speech to Text Technology
Speech to Text Technology
Account & Pricing
New Account Needed IBM Cloud Account Cloud Platform Account AWS Account Azure Account
Free Trail 30 minutes for 30 days (Lite Plan) 0-60 minutes free 60 minutes per month for 12 months 5 hours per month
Pricing $0.02 USD/minute Over 60 minutes, up to 1 million minutes Audio: $0.006 USD /15 seconds Video: $0.012 USD /15 seconds $0.0004 USD per second $1 USD per hour
Speech to Text Technology

Real-Time Speech Recognition

Speech to Text Technology

IBM Watson, Google Speech-to-Text and Azure Speech-to-Text have been found to be the most powerful in recognizing speech in real time. While using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech as a whole.

Google Cloud Speech-to-Text conversion is powered by machine learning. Its Automatic Speech Recognition (ASR) is powered by deep learning neural networking, making it work with more accuracy in real time.

IBM Watson is also good at keyword spotting when working in real time. It has a highly accurate speech engine. It recognizes different speakers in your audio and spots specified keywords in real-time with high accuracy and confidence. The service leverages artificial intelligence to transcribe the human voice accurately. It identifies the composition of the audio signal with the help of the information about grammar and language structure. With the machine learning skills embedded, it also continuously updates the transcription as more speech is heard.

Multiple Speakers and Channel Recognition

Speech to Text Technology

IBM Watson and Transcribe both have shown a good response when multiple speakers are involved in any conversation. While Cloud Speech-to-Text has included this feature in its Beta release, Watson has started giving labels to different speakers. Amazon Transcribe also recognizes when the speaker changes and attributes the transcribed text appropriately. This can significantly reduce the amount of work needed to transcribe audio with multiple speakers for telephone calls, meetings, and television shows.Amazon Transcribe and Cloud Speech-to-Text are also built to recognize multiple channels. Whether speakers are connected to the conference through an audio, video, or any other channel, the system identifies each channel and produces a single transcript with annotated channel labels. Both IBM Watson and Google Cloud Speech-to-Text are well trained in understanding audio from phone calls and videos.

Transcription: How it Works

Speech to Text Technology

Understanding Languages

Speech to Text Technology

Google Cloud Speech-to-Text is the best when it comes to language supports. It recognizes 120 languages and variants. You can also filter inappropriate content in text results for all languages. While IBM Watson is trained to understand 9 languages, Azure Speech-to-Text can do it for 30 and Transcribe is still learning with just two languages. Similar to Google Cloud Speech-to-Text, IBM Watson can also identify and point out profanity text.

Building Custom Vocabulary

Speech to Text Technology

Another important component of speech-to-text transcribing systems is getting trained or accustomed to supporting a particular business model. Luckily, all four systems have this excellent ability to train the software with custom vocabulary.

Google Cloud Speech-to-Text has a feature called Phrase Hints. This feature allows you to train your speech-to-text engine to understand custom words and phrases that are likely to be spoken. This is especially useful when the application is used for a technical use-case such as in hospitals, courtrooms, call centers, research labs, and more.

Transcribe also gives you the ability to expand the base vocabulary of the application with new words and generate highly-accurate transcriptions specific to your use cases, including product names, domain-specific terminology, and names of individuals.

IBM Watson’s Speech-to-Text service helps you go deeper than out-of-the-box solutions by providing the tooling and functionality to train Watson to learn the language of your business. New language model customization, customization weighting, and acoustic model customization features provide the flexibility you need to create effective solutions for your unique domain needs.

Azure Speech-to-text also allows for the creation of custom language models tailored to users’ speaking styles, industry expressions, and technical, geographical or market terms. Through integration with Language Understanding (LUIS), you can derive intents and entities from speech. Users don’t have to know your app’s vocabulary but can describe what they want in their own words.

Punctuation and Formatting

Speech to Text Technology

While all the systems claim to have developed an accurate punctuation ability, Google Cloud Speech-to-Text has recently worked on a new punctuation model. Google promises that its new model results in far more readable transcriptions that feature fewer run-on sentences and more commas, periods and question marks.

Amazon Transcribe uses deep learning to add punctuation and formatting automatically so that the output is more intelligible and can be used without any further editing. The system improves its accuracy with each use and training.

IBM Watson has worked on punctuation, as well as formatting needs. It can easily convert dates, times, numbers, phone numbers, and currency values into more readable, conventional forms in final transcripts of U.S. English audio.

Azure Speech-to-Text is tailored to work well with real-life speech and can accurately transcribe proper nouns (such as Sundar Pichai) and appropriately format language (such as dates and phones numbers).

Accuracy

Speech to Text Technology

A great feature all of these systems share is that they have the capability to measure their own accuracy of the transcription. While IBM Watson, Cloud Speech-to-Text and Amazon Transcribe have the ability to generate a Confidence Score for every single word, Azure’s Confidence Score is returned back from the service if you specify a detailed output on the speech configuration object.

Other than this, the new and improved Cloud Speech-to-Text API promises significantly improved voice recognition performance. The new API promises a reduction in word errors of about 54 percent across all of Google’s tests, but in some areas, the results are even better.

Amazon Transcribe is designed to provide accurate and automated transcripts for a wide range of audio quality. You can generate subtitles for any video or audio file, and even transcribe low-quality telephone recordings, such as customer service calls.

Noise Robustness

Speech to Text Technology

Google Cloud Speech-to-Text is the only system developed to handle noisy audio or background noise from many environments without requiring additional noise cancellation.

Similarly, with IBM Watson you can expect control of background noise to some extent with custom acoustic models and match your users’ expected environments.

Big Picture Benefit

Speech to Text Technology

The biggest benefit of all of these cloud providers is that they provide APIs which are platform independent and can directly be used to integrate with any platform of tools and services on which they run. Necessary to mention here is all four players are providing their cloud APIs. Techies have already started leveraging these capabilities into timesaving apps for a range of uses, including call center analytics, video indexing services, web conference indexing, and business transcription workflows.

Development Support

Amazon Transcribe enables developers to submit transcription requests via a standard REST interface which supports several formats, including WAV, MP3, MP4 and FLAC. It does, however, require Amazon S3 URL, so all files (audio/video) should be saved on S3 first. Then call “StartTranscriptionJob” AP, with S3 URL in “MediaFileUri” parameter.

For Azure Speech-to-Text API, developers can access it from any app using a REST API. In addition, Microsoft developed several client libraries to improve integration with various apps written in C#, Java, JavaScript and Objective-C. In some cases, client apps use the WebSocket protocol to improve performance. Currently, the service supports 29 languages, as well as WAV and Opus audio formats.

Google Cloud supports audio formats such as FLAC, AMR, PCMU and WAV files. Also, SDKs are available for C#, Go, Java, Node.js, PHP, Python and Ruby.

The IBM Watson Speech-to-Text service provides APIs that use IBM’s speech-recognition capabilities to produce transcripts of spoken audio. The audio format supported is WebM with the Opus or Vorbis codec, MP3, WAV, FLAC, PCM, u-law audio, and basic audio.For most languages, the service supports two sampling rates: broadband and narrowband. It returns all JSON response content in the UTF-8 character set.

For speech recognition, the service supports synchronous and asynchronous HTTP Representational State Transfer (REST) interfaces. It also supports a WebSocket interface that provides a full-duplex, low-latency communication channel (clients send requests and audio to the service and receive results over a single connection asynchronously).

Who’s the Best?

Although speech-to-text technology has not been perfected yet, we still see a great opportunity open for all four players in this technology. It’s a close call when it comes to developing custom vocabulary and training these applications with machine learning and deep learning skills. All the players have done a great job and are improving day by day. Still, if we want to choose one amongst the four, clearly Google Cloud Speech-to-Text is our first choice. Although some features like “Speaker Detection” and “Multichannel Recognition” are still in Beta with Cloud Speech-to-Text, these features work perfectly with IBM Watson and Amazon Transcribe. As far as accuracy is concerned, Google Speech API produces quite remarkable transcriptions for audio fragments even with considerable background noise. Their language support, punctuation, phrase hints, and many other features have definitely surpassed others in the race.

This conclusion is valid based on information and testing done on date for Speech-to-text technology only. This can change in the coming weeks or months or years, as every provider i.e. Microsoft, Amazon, IBM, Google etc. are working extensively on improving this with advancement in AI and Natural Language Processing (NLP) capabilities. The intense research and development by each of the company brings value to the developer working on it, and also helps to create more convenient tools and applications for the masses. We are open for feedback from anyone and we would be glad to update our findings and release new findings soon.

To conclude, developers have a huge benefit here. These AI-infused speech-to-text apps have created a complete path for developers to build dictation applications that can automatically generate transcriptions for audio files, as well as captions for video files. They should incorporate these tools into workflows that complement human transcribers.

Looking for a transcription solution?

Have a business use case which you want to implement on Voice technology? You are at the right place! Contact us now to discuss your project.

Post a Comment