This post discusses about bringing speech recognition to the web. Speech recognition is in short a technology that converts spoken words to text. Voice or speech recognition has been popular in the desktop software world. Popular examples of this include the speech recognition system used in Windows XP, Vista and Seven for giving voice commands and controlling the system. Another popular example would be the speech recognition feature in Microsoft Office that helps in dictating text so that the users can write text just by dictating it to the computer.
With the new draft specification of HTML5 SPEECH INPUT this facility will be made available for the web so that the speech recognition can be carried out in the web world with ease.
*To read the full specifications of HTML5 SPEECH INPUT API visit : http://www.w3.org/2005/Incubator/htmlspeech/2010/10/google-api-draft.html
The API itself is agnostic of the underlying speech recognition implementation and can support both server based as well as embedded recognizers. In case of embedded recognizers the browser itself would have the capability of speech recognition and this would be quite similar to the current softwares that do speech recognition. In this approach the browser would record the voice from the microphone and perform the speech recognition process on the input voice locally and generate the resultant text. This would be a fast process and could be done offline as well. Whereas in the second approach the browser would record the voice from the microphone and stream the audio data to its server which is responsible for the speech recognition and after the speech recognition process at the server, it would send the result text to the browser. The advantage of using a server based approach is that speech recognition would be more precise and accurate than the local approach because large amount of training data collected at the central servers help improve accuracy of the speech recognition. The API is designed to enable both one-off speech input and continuous speech input requests. Speech recognition results are provided to the web page as a list of hypotheses along with other relevant information for each hypothesis.
In my demonstration, Chrome is the browser which captures audio and streams to Google’s servers for speech recognition and the text is resulted from the servers and sent to Chrome browser. In this demonstration the software part that has the responsibility to capture audio and stream to servers is embedded directly in the Chrome web browser.
* For extra research you can look at the Chrome web browser source code related to speech http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/speech/ . Audio is collected from the microphone, and then sent to a Google server (a Gwebservice) using HTTPS POST, which returns a JSON object with results. Check out the source code or visit http://mikepultz.com/2011/03/accessing-google-speech-api-chrome-11/ for a little more information.
* As it is clear that unless you have your own browser product as like Google has Chrome you will have to build an extension that will be attached to the browser and will handle the audio capture and streaming responsibilities. And you also need servers that will do the speech recognition for you. Or you can also opt for the first approach and embed your recognizer in your extension that you built.
There are other approaches as well that do not relate to HTML5 or SPEECH INPUT API. They implement speech recognition for web using different implementations but using the same technology as I discussed above. The strategy followed by them is that a flash based component resides on the webpage which captures the audio and streams the audio to their servers and get the result back from the server.
* Note there can be and would be many more implementations to use speech recognition on the web. These are the ones I came across.
HTML5 is a language for structuring and presenting content for the World Wide Web, a core technology of the Internet. It is the fifth revision of the HTML standard. In particular, HTML5 adds many new syntactical features. HTML5 introduces a number of new elements and attributes that reflect typical usage on modern websites. In addition to specifying markup, HTML5 specifies scripting application programming interfaces (APIs).[HTML5 new features and specifications are not achievable without CSS and JS. So bluntly HTML5 =HTML + CSS +JS. Knowing briefly about HTML will help us to better understand the details of SPEECH INPUT.
The <input> html element (<input type=”text” name=”text 1″>) is extended in the HTML5 SPEECH INPUT specification to allow speech recognition and input facilities. The input element is extended because the intended aim of the api is to allow input of data by voice or speech. This makes it clear for the name “SPEECH INPUT API”.
*So a basic knowledge of the input element is needed. Full practical details will be discussed at a later stage.
For a working demonstration of the SPEECH API visit : http://www.robinrizvi.info/speechapidemo/
In this demonstration I am presenting an example of navigating the website by issuing voice commands i.e the user can speak the links through his microphone to navigate to the link.
The main line in the above HTML code that does the magic is:
<input id=”speech” type=”text” speech=”speech” x-webkit-speech=”x-webkit-speech” onspeechchange=”processspeech();” onwebkitspeechchange=”processspeech();” />
Here the <input> html element is extended to include more properties/attributes and events so that the speech functionality could be achieved. The new attributes and methods added are:
speech=”speech” : tells the browser that it is not a normal <input> element rather it is an <input> element that can take input by speech or voice. This adds a small mic to the right of the <input> element which can be clicked so that the browser can capture voice from the microphone. x-webkit-speech=”x-webkit-speech”, this attribute is just a redundant attribute which will possibly be removed. This attribute is not in the draft specification. But this attribute is necessary for the demonstration to work because Google Chrome recognizes the x-webkit-speech attribute instead of the speech attribute. speech is just prefixed with x-webkit. Its just a difference of name as specified in the browser’s engine nothing very special about it.
* For extra knowledge, webkit is the web browser engine (called layout engine or rendering engine) of Google Chrome web browser. Each browser has an underlying engine that does the work of interpreting html, css and js and laying out the elements on the browser screen. For instance Gecko is the layout engine of Firefox, Trident is the layout engine of Internet Explorer and Presto is the engine for Opera. These layout engines are the core or kernel of any web browser and most of them are open source including gecko, webkit and others.
onspeechchange=”processspeech();” : This subscribes the processspeech() event handler to the speech change event which occurs when the speech or voice input changes the value of the <input> element. processspeech() is just a function name and could have been anything else. onwebkitspeechchange=”processspeech();”, this event is just a redundant event as the redundant attribute discussed above. But this event is necessary for the demonstration to work because Google Chrome recognizes the onwebkitspeechchange event instead of the onspeechchange event.
* This phenomenon of redundant attribute and event may seem familiar if you are acquainted and worked with some of the css properties that are prefixed with -moz and work only on mozilla/gecko browsers like -moz-transform and others.
The second section of the code is processspeech() event handler for the speechchange event. After the speech is converted to text and saved in the input text box the event handler gets executed . The rest of the code here is quite easy to understand so I will not be discussing it.
* The various animations of the interface that I built, I will not be discussing those to keep the content concise.
CSS does not play any significant role in the SPEECH INPUT API. Speech Input is all about the html <input> element and the handling of events by js which are triggered by that <input> element. I have used CSS here just to hide the textbox associated with the html <input> element and to show only the microphone icon that is to the right of the textbox. We also scale the microphone so that it looks bigger and replace the text cursor that comes when we hover on the microphone with a hand cursor.
* I have not discussed all the css that were used to style and position the speech input element. Just take a look at the source while you are viewing the demonstration.
Please drop me a email at firstname.lastname@example.org. It would be appreciated.