Speech recognition and synthesis with simple JavaScript

Arjun Mahishi

7 years ago

Speech with javascript

Speech – The action that human beings commonly use to interact with other human beings or even pets. But when it comes to computers, we use actions like click, type, drag, drop etc. For years people have tried to speak to computers in various ways and they have undoubtedly succeed in doing so. Today, it is also possible to do speech recognition using the computation power of just a browser.

Speech recognition is also called speech-to-text. And speech synthesis is also called text-to-speech. These are the terms we will be using in this post. They are both very simple and easy to implement in just a few lines of code. And it is unbelievably accurate given the fact that it runs on a browser.

For the sake of making the concepts clear, we will be making a simple demo that recognizes the user’s speech and repeats after him with the synthesized speech. Lets get started.

Text-to-speech

Converting text to speech is the easiest of them both. There is an in-built api and we just need to call it to. Let’s see how it works step-by-step with code.

SpeechSynthesisUtterance() is the class we will be working with to generate speech.
Lets make a function the takes the text as an argument and renders the voice as output.
```
const speak = (text) => {
}
```
Make an object called msg. Pass text as an argument to the constructor of the class.
```
const speak = (text) => {
    var msg = new SpeechSynthesisUtterance(text);
}
```
Now, we call the main in-built speak function and pass in our msg object and an argument.
```
const speak = (text) => {
    var msg = new SpeechSynthesisUtterance(text);
    window.speechSynthesis.speak(msg);
}
```
Now the web page is capable of speaking any text. Test it out by attaching this script to a web page and calling our speak() function. You should hear a default female voice.
We can change the voice by modifying the voice property of msg like below.
```
const speak = (text) => {
    var msg = new SpeechSynthesisUtterance(text);
    msg.voice = window.speechSynthesis.getVoices()[3];
    window.speechSynthesis.speak(msg);
}
```
This should give you a male British voice. This one is my personal favourite because it sounds like Alfred from the batman comics. You can play around with the index to try different voices.

So, this is how you generate speech on the web. Now, lets look at the more fun part of the post.

Text-to-speech browser support

Speech-to-text

This is slightly trickier. Because, just like in human beings, listening is always harder than speaking.

To get started with speech-to-text, we need to do some setting up.
```
var SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
var recognition = new SpeechRecognition();
recognition.continuous = true;
```
We create an object called recognition upon which we will be work. We make it’s continuous property true. This ensures that our speech is not considered as one big speech. This way the result appears faster as it doesn’t wait for us to complete what we are saying by looking for a long pause.
The recognition object has several event listeners. An event listener is a function that fires of when a particular event occurs. Right now, we are only concerned with two event listeners. onend and onresult. As the names suggest, onend is fired when the speech recognition ends. And when that occurs, we will need to restart it back as we want it to listen to us indefinitely. onresult is fired when some speech is recognized. It returns an object with the speech in its text form.
```
var SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
var recognition = new SpeechRecognition();

recognition.continuous = true;

recognition.onend = function() {
    recognition.start();
}

recognition.onresult = function(event) {
    var current = event.resultIndex;
    var transcript = event.results[current][0].transcript;
}
```
In the above code,
```
var current = event.resultIndex;
var transcript = event.results[current][0].transcript;
```
these two lines are basically parsing the result object and storing the actual text in a variable called transcript
Our very own speech recognition system is almost ready. All we have to do now is call recognition.start() and we are good to go.

Speech-to-text browser support

And finally the Demo

Quickly setup a project with 3 files in it. index.html, tts.js and stt.js. Put the text-to-speech code in tts.js and the speech-to-text code in >stt.js. In index.html, just include the two scripts.

<script src="tts.js"></script>
<script src="stt.js"></script>

Now, we need to call the speak() function we wrote inside the onresult of the recognition.

recognition.onresult = function(event) {
    var current = event.resultIndex;
    var transcript = event.results[current][0].transcript;
    speak(transcript)
}

Now open this index.html in a browser and try it out. You will need to host this on a http server locally. Its not a big deal. This is needed because https is mandatory for accessing the microphone. When you open the page, it should ask for permission the first time. Accept it and say something. It should repeat it right back.

The full code for this can be found here

Author
Recent Posts

Arjun Mahishi

Human by birth, machine by behaviour, geek by choice.