ReconVox is our high performance speech recognition product. Thanks to its ability to recognize both isolated words and continuous speech for any speaker without needing specific training, it fits into a wide range of applications, from controlling electronic devices with your voice to accessing through the telephone line automatic services driven by full and complex sentences.
This level of technology is called speaker independent continuous speech recognition and it allows applications to understand full sentences close to natural language with large vocabularies from any speaker.
As well as BioVox, ReconVox isn’t a closed architecture application with a predefined graphical user interface, but an open development platform that exports all the functionality via a powerful API (Application Programming Interface), designed to be easily integrated into any application or target hardware.
ReconVox has been fully developed in house by DTec.
ReconVox can work both in live mode processing the utterances as they come and in batch mode, analyzing recordings stored in audio files.
In addition, ReconVox provides advanced features that gives increased value and brings you to the state of the art in speech technology. One of them is AutoLearn. Thanks to it the recognition engine learns by itself dynamically as it’s being used in order to automatically adapt to the specific features of the voice of a given speaker, a dialectic region or even an acoustic environment with a characteristic ambient noise. This way the more the system is used, the further improves the accuracy of the recognition.
If maximum accuracy is paramount, it’s also possible to work in supervised mode, tutoring the learning process giving AutoLearn known utterances along with their transcriptions. This way the learning process is accelerated and the recognition accuracy is maximized.
Another one of the special features provided by ReconVox is what we call ConfScore. With this functionality you can get an additional confidence score along with the transcription of the utterance for every word, as well as a global score for the whole sentence. These scores point out whether there has been a high level of uncertainty in the recognition process.
Therefore this feature is specially useful in those situations where out of vocabulary words must be taken into consideration, or when the acoustic conditions in the final recognition environment are expected to be remarkably noisy and thus more inclined to produce recognition errors.
When the recognition task is intended to extract some keywords o special sentences from free text utterances with no vocabulary restrictions where any number of out of vocabulary words can arise, a WordSpotting strategy may be the best suited. Thanks to it it’s possible to establish a flexible recognition strategy, not bound to a fixed syntax, capable of listening to a continuous stream of audio and spotting just the words of interest. Thus, this functionality can benefit from a joint use along with ConfScore, although both of them are independent and can be used separately.
- Recognition task can be fine tuned: isolated words or continuous speech.
- Speaker independent: doesn’t need to be retrained for every speaker.
- AutoLearn: automatic adaptation for a specific speaker, dialectic region or noisy environment.
- ConfScore: confidence scoring for recognition results, both word and sentence level.
- WordSpotting: detection of keywords or special sentences among out of vocabulary words.
- Vocabulary can be customized: from a few commands to thousands of words.
- Two different types of language models: fixed syntax or flexible grammar.
- Efficient recognition engine: can be integrated into embedded systems.
- Available in spanish, US english and UK english. New languages can be incorporated upon request.
- Available both in DLL (Dynamic Link Library) format for Windows and in shared object format for UNIX/Linux. Please ask for other platforms.
AutoLearn is an exciting new technology that allows ReconVox to learn and improve its accuracy as is being used. Supports two working modes:
- Dynamic: in this mode AutoLearn manages the learning process all by itself. It internally stores the utterances as they are provided by the user as well as their associated transcriptions coming from the recognition. Then it automatically adjusts the acoustic models whenever there are enough adaptation data, in a periodic and incremental fashion. When this improvement takes place, the brand new models are immediately used to recognize the next utterances given to the system and the process starts again in a new learning iteration that will improve the previous one. All these operations are carried out automatically, without the explicit participation of the user. He or she just activates AutoLearn and keep using ReconVox the usual way because all the learning process is transparent to the user.
- Supervised: if accuracy must be maximized and the learning process accelerated, it’s possible to guide AutoLearn through the process. To do that the user provides known, clean utterances along with their transcriptions. This way the utterances used for the learning process can be guaranteed to be error free and phonetically rich, maximizing the accuracy gain, because the user has total control over the number of utterances to be used, their length, vocabulary and recording channel.
There are some situations where it can be extremely useful to get feedback about the confidence level ReconVox has for some of the words just recognized in the last utterance. They may be key for understanding the sentence or they may be surrounded by lots of unknown out of vocabulary words, for example. In these situations ConfScore can help providing additional information.
When this feature is enabled, ConfScore gives along with every word recognized a confidence score that points out the estimated probability of it to be actually there in the utterance. In addition, a global confidence score for the whole sentence is returned too.
These confidence scores must be calculated independently, besides the actual recognition, so there is a performance hit associated. That’s why ConfScore can be enabled or disabled for any single utterance, because there are applications where response times are critical.
- IVR (Interactive Voice Response): conversations close to natural language in automatic call centers.
- Alarms and domotics: electronic devices controlled by voice commands, (de)activation of alarms…
- Help for the impaired people: electronic devices driven by voice commands coming just from the authorized speaker.
- Voice commands in cars: GPS, hands-free phone calls…
- Automatic search by content: spotting of keywords or sentences in audio/video recordings or streaming audio.
- Education: scoring of pronunciation for every single word in language learning or in some speech pathologies like dyslexia or aphasia.
Is the technology that automatically provides the transcription of utterances pronounced by a speaker. So while voice biometrics systems like BioVox give an answer to the “who?” question, speech recognition answers the “what?” question. Speech recognizers can be classified according to the size of the accepted vocabulary and the word rate in the audio stream:
- Isolated words, where every word is pronounced one at a time, pausing between them.
Short sentences, for command and control applications, limited to specific sentences that include connected words, without pauses between words.
Large vocabularies close to natural language, capable of recognizing thousands of words in naturally pronounced sentences.
If we focus in the speaker there are speaker independent speech recognizers in contrast to the speaker dependent ones, typically used in dictation tasks, that need to be trained specifically for a given speaker before they can be used and therefore can’t switch to new, unknown users on the fly like speaker independent ones can. The advantage for dependency is usually higher recognition rates.
The underlying Automatic Speech Recognition (ASR) technology in ReconVox is the most widely used and tested in the field and it’s the one that provides the best recognition rates for continuous speech nowadays. This technology is based in a stochastic theory called Hidden Markov Models (HMM). These models reflect the internals of speech production by representing the fundamental sounds that conform words via connected states with transitions between them governed by probabilities. In order to estimate accurately the optimum values for these parameters, lots of recordings with utterances pronounced by lots of different speakers with different vocal features are used. These utterances have to be phonetically balanced and ideally recorded through the same channel to be used in recognition time, so the parameters are representative of the task to be modeled and thus estimated reliably.