Dienstag, 5. April 2011

GSoC idea: Ubiquitous Speech Recognition

The Google Summer of Code application period for students closes in a couple of days and I still have one last idea for simon for any student still looking for a project: Ubiquitous Speech Recognition.

Some of you might already know that simon already supports recording (and recognizing) from multiple microphones simultaneously. Sound cards and microphones are comparatively cheap and the server / client architecture of simon would even allow for input from mobile phones, other PCs, etc.

We also have gadgets and home appliances getting smarter and smarter every year. KNX is getting increasingly popular, is already included in many new electrical installations and allows home automation for a very fair price.

Voice control is an intuitive way to interact with all kinds of devices and - compared to alternatives like touch screens and the like - also quite cheap. simon already has more than enough interfaces to connect up your favorite home automation controllers / hardware interfaces. Something that people are already doing.

However, speech recognition has traditionally relied on controlled environments. False-positives are still a major issue and recognition accuracy depends on being optimized for a certain situation.

Still: Adapting the recognition to certain situations is already part of another GSoC idea (that fortunately already has a very promising student attached to it) so that leaves the voice activity detection part as the remaining hassle.

The voice activity detection (in short: VAD) tells the system when to listen to the user and tries to distinguish between background noise and user input. Normally this is just one comparatively minor part in a speech recognition system but when your whole apartment (or at least parts of it) are listening for voice input this becomes kind of important :).

The current system in simon just compares the current "loudness" to a configurable threshold. This is fine for headset users but almost useless in the above scenario.

And here is where it's your turn to get creative: Try to find a novel approach to separate voice commands from background noise.

For example: Use webcams and computer vision algorithms to determine if the user is even near a microphone at the time of the heard "command".

You could also define "eye contact" with a camera as the signal to activate the recognition.  Or maybe you could deactivate the system unless the user raises his hand before he speaks?

Another idea would be to let different microphones work together and subtract the similarities (to filter out global noise).

You can also use noise conditioning to remove the music playing over the PC speakers automatically from the input signal.

Or why not use the reception strength of the users bluetooth phone to determine in which room he currently is?

Bonus points for coming up with other ideas in the comment section!

Kommentare:

toddrme2178 hat gesagt…

There are already standard and less standard algorithms for determining the intelligibility of a certain segment of speech. They aren't perfect, but running the speech through such an algorithm and setting a threshold for intelligibility would be an option.

Humans usually extract speech from complex environments using sound localization cues. In other words, they extract sounds coming from a particular location from sounds coming from a different location. With your two-microphones idea, you could using the arrival time of the sounds to the two microphones then see if it is in the same direction as a face seen in a webcam (this would need to be calibrated so it knows the angle of the webcam relative to the microphone positions). This would need timing information in the millisecond to tens of millisecond accuracy, depending the microphone separation. A military robot is already using this technique for locating the direction of sniper fire. This becomes more difficult in a highly reflective environment, though.

Speech also tends to have different statistics than most noise, so you could just look for those statistics.

Many types of background noise are fairly constant and repetitive (fluorescent lights, air conditioners, refrigerators). You could break sound into several-second chunks and only even begin to analyze the results when the statistics of a chunk differs significantly from those of the average of the last several chunks. It would continue analyzing results until it reaches a chunk that was similar to the chunks before it was triggered.

Anonym hat gesagt…

What about using Mel Frequency Cepstral Coefficients with a Gaußian Mixture Model/ k-Nearest Neighbour Classification System?
For the distinction Speech <-> non Speech the correct classification rates should be pretty good.

Peter Grasch hat gesagt…

@toddrme2178: Good idea but I wonder if the speech intelligibility algorithms work when there is music playing in the background. Analytically, many types of music should probably provide fairly similar "fingerprints" than human speech. But that's just a guess not having looked at the algorithms yet.

The localization part is interesting but I wouldn't raise it to military levels of precision :). The recognition should be robust enough to cope with a little bit of background noise so this could be simplified as to subtract the signal from microphones farther away phase shifted with a pre-determined time delay for different "zones" of the room. Using a simple tracking algorithm it should be fairly easy to detect in which zone the user currently is if there is only one user.

Noise that is completely different to voice is generally not the problem (spectral subtraction is already supported by Julius).

The differential VAD is a good idea. Would be interesting how it'd cope with changing background noises tough (again, music could be a problem).

@Anonymous: Yes that is something that has already been suggested to us but we didn't yet get to implement.

@Everyone: I'll add that I'd offer to mentor anyone interested in this or any of the other ideas posted on this blog not only as part of a GSoC project. If you are interested just drop me a line at grasch ate simon-listens°org.

toddrme2178 hat gesagt…

"Analytically, many types of music should probably provide fairly similar "fingerprints" than human speech. But that's just a guess not having looked at the algorithms yet."

I wouldn't assume this personally. The point is to determine how easy it is to understand speech, so if the algorithms are working properly music shouldn't score highly (but that depends on how effective the algorithms are).

As for localization, I would look into the Equalization-Cancellation model, which is a multi-microphone sound localization and noise reduction model.

"Would be interesting how it'd cope with changing background noises tough (again, music could be a problem)."

It wouldn't, this would just be a lightweight first pass to determine whether the system should begin using the other more complex and likely more costly algorithms. It would have lots of false positives, but it shouldn't have many, if any, misses.

Anonym hat gesagt…

Hi, I'm currently looking to set up Simon to do something similar (full-room voice rcognition).

I was wondering already if it would be useful to use the post-processing step to filter out certain "constant" noises such as running fans. When you say "spectral subtraction is already supported by Julius", does that mean you're already doing this?

About music: one of the things I want to control with Simon is a music server. If Simon and the music player are running on the same machine, whould it be possible to send the music stream to Simon as well, so it can substract it from the microphone stream? It might need some milliseconds delay, or adaptations to the frequency spectrum if the music is partly blocked/reflected before it reaches the microphone, but I guess those are problems that have already been solved by echo cancelation algorithms.

Another more generic option would be to put microphones close to noise sources (speakers, washing machine, ...) and substract those streams from the command stream. It would need potentially a LOT more microphones, and I don't know how much CPU power this would require...

Greets, seven

Peter Grasch hat gesagt…

Hi,

Spectral substraction is not activated by default. To activate it, open the file ~/.kde/share/apps/simond/models/default/active/julius.jconf and un-comment the switch -sscalc and -sscalclen 300.

> If Simon and the music player are running on the same machine,
> whould it be possible to send the music stream to Simon as well,
> so it can substract it from the microphone stream?
No such mechanism has been implemented yet in simon but it might be doable on a lower level (using a pseudo-input device that does this internally). You might want to check out LADSPA, though I don't know if they provide echo cancellation (http://www.ladspa.org/). Note, however, that this is really not as easy as it usually sounds :)

> Another more generic option would be to put microphones close to noise sources (speakers, washing machine, ...)...
Generally, there are (AFAIK) two approaches: Measure sounds where they happen or measure them near your microphone. I could imagine two identical microphones quite near next to each other - one pointing to the speaker, one pointing away. This is usually also the tactic that e.g. noise canceling headsets use.

This brings me to my final point: You might find a noise cancelling mic that already includes such technology on the hardware side...

Best regards,
Peter

Anonym hat gesagt…

Hi,

you wrote : "KNX is getting increasingly popular, is already included in many new electrical installations and allows home automation for a very fair price"

As I saw, KNX is all except a cheap solution. Have a look to the annual member fee at : http://www.knx.org/knx-members/joining-fees/

Have a nice day,

Miguipda ;-)

Peter Grasch hat gesagt…

You are pointing to membership fees for joining the KNX organization. This organiazation is to the standard about the same as what W3C is to HTML.

KNX _products_ are fairly affordable, though.

Best regards,
Peter