Dienstag, 5. April 2011

GSoC idea: Ubiquitous Speech Recognition

The Google Summer of Code application period for students closes in a couple of days and I still have one last idea for simon for any student still looking for a project: Ubiquitous Speech Recognition.

Some of you might already know that simon already supports recording (and recognizing) from multiple microphones simultaneously. Sound cards and microphones are comparatively cheap and the server / client architecture of simon would even allow for input from mobile phones, other PCs, etc.

We also have gadgets and home appliances getting smarter and smarter every year. KNX is getting increasingly popular, is already included in many new electrical installations and allows home automation for a very fair price.

Voice control is an intuitive way to interact with all kinds of devices and - compared to alternatives like touch screens and the like - also quite cheap. simon already has more than enough interfaces to connect up your favorite home automation controllers / hardware interfaces. Something that people are already doing.

However, speech recognition has traditionally relied on controlled environments. False-positives are still a major issue and recognition accuracy depends on being optimized for a certain situation.

Still: Adapting the recognition to certain situations is already part of another GSoC idea (that fortunately already has a very promising student attached to it) so that leaves the voice activity detection part as the remaining hassle.

The voice activity detection (in short: VAD) tells the system when to listen to the user and tries to distinguish between background noise and user input. Normally this is just one comparatively minor part in a speech recognition system but when your whole apartment (or at least parts of it) are listening for voice input this becomes kind of important :).

The current system in simon just compares the current "loudness" to a configurable threshold. This is fine for headset users but almost useless in the above scenario.

And here is where it's your turn to get creative: Try to find a novel approach to separate voice commands from background noise.

For example: Use webcams and computer vision algorithms to determine if the user is even near a microphone at the time of the heard "command".

You could also define "eye contact" with a camera as the signal to activate the recognition.  Or maybe you could deactivate the system unless the user raises his hand before he speaks?

Another idea would be to let different microphones work together and subtract the similarities (to filter out global noise).

You can also use noise conditioning to remove the music playing over the PC speakers automatically from the input signal.

Or why not use the reception strength of the users bluetooth phone to determine in which room he currently is?

Bonus points for coming up with other ideas in the comment section!

Montag, 4. April 2011

GSoC idea: Voice Control for the Linux Desktop

As this has worked so perfectly the last time, I want to use this blog post to present another idea for the Google Summer of Code 2011 that has not yet found an interested student.

The simon system currently has plugins to trigger shortcuts, simulate clicks and interact directly with applications through IPC technology like DBus and JSON. This makes simon perfect for interacting with a vast variety of applications as long as it is configured for each application beforehand.

To counteract this, we have the scenario system that allows users to exchange such configurations online. This repository already covers many of the "standard" applications.
Still: The user has to actively pick which applications to control. If there is no scenario available for an application, things get a bit more complicated.

So how could we create dynamic scenarios that allow the user to control new applications without configuring anything?

Well let's look at what's needed to voice control an application.

First of all, we need to know what options are currently available.

Let's look at KWrite as an example application:

Just looking at the screenshot a human can quickly tell that there are at least the following commands: "New", "Open", "Save", "Save As", "File", "Edit", etc.

Well if screenreaders can read those options to the user, why shouldn't simon parse them automatically as well?

With the upcoming AT-SPI-2 and the Qt accessibility bridge, the user interface (including buttons, menu items, etc.) are all exported over DBus.

As elements can also be triggered (clicked / selected) over this interface, simon can easily "read" running applications and create appropriate commands.

Best of all: Because screenreaders are well established, many applications already make sure that this will work properly.

Vocabulary and Grammar
Now that we have our commands in place simon still needs to recognize all those words ("New", "Save", etc.) that are probably not in the users active vocabulary.

As speech recognition systems need a phonetic description of each word that is not trivial.

...if it weren't for Sequitur. Sequitur is a grapheme to phoneme converter that translates any given text to a phonetical description.

The system can be compared to a native speaker: Even if you have never heard a word spoken out loud you still have at least a rough idea about how to pronounce it. That's because there are certain rules in any language that you know even if you aren't aware of them.
Sequitur works in much the same way that it learns those rules by reading large dictionaries. With the generated model it can transcribe even words that were not in the input dictionary.

In our tests, sequitur prooved to be very reliable, accurate and quite fast.

simon already allows the user to specify a dictionary large enough to act as the information source for sequitur: The shadow dictionary. Because there are already import mechanisms for most major pronunciation dictionary formats, there is more than enough raw material to "feed" to sequitur already available.

Now that we have the vocabulary, setting up an appropriate grammar is very easy. Just make sure that all the sentences of the created commands are allowed.

For static models no training data is required so that's all that'd be needed.

With a combination of AT-SPI-2 and Sequitur one could quite easily extend the current simon version to automatically create working voice commands for all standard widgets of running applications.

This allows the user of a static model to comfortably use any application-specific configuration at all.

Because AT-SPI-2 is a freedesktop.org standard, the resulting system would automatically work with both Qt and KDE applications as well as Gnome applications.

If you are interested in working on this idea, please send me an email.

Samstag, 2. April 2011

GSoC idea: Crowdsourcing Speech Model Training

There still is a week left for students to apply for Googles annual Summer of Code.

Following Lydias recommendation on the mailing list, I've decided to showcase some ideas for simon that are not yet taken by any student on this blog for the remainder of the application period: If you'd like to implement one of those ideas, please feel free to send me a mail at grasch ate simon-listens ° org.

The first idea that is still up for grabs is simons voxforge integration. Voxforge is an ambitious project to create free (GPL) speech models for everyone. With the current Voxforge models, simon can already be used without any training at all. Just download simon and the appropriate model from the Voxforge website for your language and start talking to your computer.

This works because the Voxforge models have been trained with lots and lots of voice recordings from people around the world. The resulting model is speaker-independent and works quite well for most people. If you need even more accuracy, just adapt the general model to your voice with a couple of training session and you are ready to go.

The current Voxforge model for English is quite good for command and control but nowhere near powerful enough for dictation. The models for other languages consist of even fewer samples. In the last five years, 624 identified users submitted voice recordings for the English model. Only 50 identified people submitted recordings for the German Voxforge model.

I think this is primarily because donating voice (through the Java applet on the Voxforge homepage) is only done by those who are actively searching for ways to improve open source speech recognition. There is also no immediate pay off for the donators.

simon on the other hand reaches a wide array of people interested in open source speech recognition: More than 24.000 in the past 12 months.

Many of those users train simon to get the most out of their system. But those trainings samples never get submitted to Voxforge to improve the general model because there is no easy way to do that.

I propose to implement an easy to use uploading system that allows the user to submit his training samples directly to the voxforge corpus with the press of a button.

Together with an automatic download of the voxforge model for a selected language when simon is launched for the first time this means that simon users can:
1. Get started with the general model even easier because they don't have to download it manually
2. If the recognition rate is too low, they can (and in our experience often will) train their model locally.
By submitting the recorded samples for the local training back to Voxforge, they not only submit valuable recordings - more often than not they would even submit exactly those recordings that train words that couldn't be recognized with the previous Voxforge model.

And because users can immediately see if their samples are helping or hurting (by checking if the recognition rate improves locally), the generated submissions should be fairly high quality. There is even an immediate advantage for the end-user (their recognition rate improves).

If you are interested on working on this proposal please contact me at grasch ate simon-listens ° org.