Accessibility in the world of computing is the priority task when developing new programs and components for our computers. Thanks to this, the use of a computer can be expanded to everyone, including people with various physical disabilities, such as people with visual impairment who are practically impossible to use a peripheral as necessary as the screen, or people with motor disabilities who can find it a really expensive task to use the mouse or keyboard.
Speech Synthesis, also known as Text to Speech Conversion (CTV), consists of providing the system with the ability to convert a given text into speech. This can be done through recordings previously made by people. The computer voice can be generated by joining the recordings that have been made, whether of whole words or phonemes, but always trying to make the sound produced seem as natural and intelligible as possible, correctly chaining the sounds within the speech. The system has to be able, in addition to all this, to synthesize any random text, not one established by default.
Concatenative synthesis is based on the union of recorded speech segments. This method produces a more natural synthesis but is lost due to variations in speech. There are three methods to perform Concatenative Synthesis. The so-called Synthesis by a selection of units uses a database in which there are voice recordings of both phonemes, syllables, words, phrases, and sentences. This method is the one that produces the most natural sound, but these databases can be very large.
But this method is not the only one that exists as far as Concatenative Synthesis is concerned. The synthesis by phonemes uses a minimal database in which a single example of diphones has been stored (in Spanish there are approximately 800 different diphones), but this method produces a robotic voice so it is practically in disuse.
Another method of Concatenative Synthesis is a domain-specific synthesis that joins phrases and words to create complete outputs. This method is used for very limited areas such as gas stations.
This method does not use runtime human speech samples like the previous ones but instead uses an acoustic model. An artificial speech wave is created. This method produces a robotic sound and could never be confused with the human voice, but it has the advantage that it produces smaller programs since it does not require a database of recorded samples like the concatenation methods.
Speech recognition allows a human being to communicate with a computer.
Broadly speaking, it consists of the computer capturing the voice signal emitted by a person through a microphone, converting it into digital information. The speech engine must be able to recognize the syllables from among a set of phonemes that it has received, and combine them to form the words that had been previously spoken by the user.
There are two great fields within the recognition of Speech:
- Automatic Speaker Recognition (RAL)
- Automatic Speech Recognition (RAH)
Automatic Speaker Recognition
An automatic speaker recognition system allows the system to check if the person who has emitted the voice signal is really who they say they are. To do this, you have to train the system. The announcer must enter voice samples so that the system can create a series of patterns. Once this has been done, the system is said to be trained and ready to recognize the speaker.
This method is being widely used in criminology to identify criminals through recordings of their voices. Another application in this field of voice recognition is security. Each person generates different parameters when speaking and it is difficult for two users to generate the same patterns even though they seem the same person to the human ear.
Automatic Speech Recognition
Automatic speech recognition presents several modalities as a result of a series of restrictions that are imposed on the recognition task to simplify it. The main parameters used to perform this type of recognition are the following: speech modality, speech style, training, vocabulary size, language model, etc.
An ideal Automatic Speech Recognition system is one that works in environments with very high background noise, and is capable of recognizing the speech of any speaker, corrects errors caused by mispronunciation, and is also insensitive to variations. induced by communication channels, but this system has not yet been achieved.
Some of the modalities presented by the RAH are:
- Recognition of isolated words (RPA): vocal input is performed word by word or by detecting edges to determine the beginning and end of the same and compare them with a database, created after training, to choose the one that best suits approximately.
- Keyword Detection (DPC): the entry of a specific word that is immersed within a spoken speech is detected.
- Connected Word Recognition (RPC): Sequences of a finite set of words forming a vocabulary of moderate or small size are supported as vocal inputs.
- Automatic Continuous Speech Recognition (RAHC): must be able, avoiding slight nuances, to recognize continuous and spontaneous speech as that which is used naturally in everyday life. This is not yet possible.
Also Read: What Is A Smart Router And It’s Importance