Samstag, 2. April 2011

GSoC idea: Crowdsourcing Speech Model Training

There still is a week left for students to apply for Googles annual Summer of Code.

Following Lydias recommendation on the mailing list, I've decided to showcase some ideas for simon that are not yet taken by any student on this blog for the remainder of the application period: If you'd like to implement one of those ideas, please feel free to send me a mail at grasch ate simon-listens ° org.

The first idea that is still up for grabs is simons voxforge integration. Voxforge is an ambitious project to create free (GPL) speech models for everyone. With the current Voxforge models, simon can already be used without any training at all. Just download simon and the appropriate model from the Voxforge website for your language and start talking to your computer.

This works because the Voxforge models have been trained with lots and lots of voice recordings from people around the world. The resulting model is speaker-independent and works quite well for most people. If you need even more accuracy, just adapt the general model to your voice with a couple of training session and you are ready to go.

The current Voxforge model for English is quite good for command and control but nowhere near powerful enough for dictation. The models for other languages consist of even fewer samples. In the last five years, 624 identified users submitted voice recordings for the English model. Only 50 identified people submitted recordings for the German Voxforge model.

I think this is primarily because donating voice (through the Java applet on the Voxforge homepage) is only done by those who are actively searching for ways to improve open source speech recognition. There is also no immediate pay off for the donators.

simon on the other hand reaches a wide array of people interested in open source speech recognition: More than 24.000 in the past 12 months.

Many of those users train simon to get the most out of their system. But those trainings samples never get submitted to Voxforge to improve the general model because there is no easy way to do that.

I propose to implement an easy to use uploading system that allows the user to submit his training samples directly to the voxforge corpus with the press of a button.

Together with an automatic download of the voxforge model for a selected language when simon is launched for the first time this means that simon users can:
1. Get started with the general model even easier because they don't have to download it manually
2. If the recognition rate is too low, they can (and in our experience often will) train their model locally.
By submitting the recorded samples for the local training back to Voxforge, they not only submit valuable recordings - more often than not they would even submit exactly those recordings that train words that couldn't be recognized with the previous Voxforge model.

And because users can immediately see if their samples are helping or hurting (by checking if the recognition rate improves locally), the generated submissions should be fairly high quality. There is even an immediate advantage for the end-user (their recognition rate improves).

If you are interested on working on this proposal please contact me at grasch ate simon-listens ° org.


Petr hat gesagt…

I get it! It first asks for the gender and age, and then says it won't send personal information. Subtle!
Hopefully that's an April fools day only feature :)

jospoortvliet hat gesagt…

Peter: that is such an AWESOME idea! Crowdsourcing those voices in such a way would help voxforge enormously and get us much better out-of-the-box results. Way to go, I hope you find a student!

Peter Grasch hat gesagt…

@jospoortvliet: Thanks!
We already got an interested student who already started to dig into the KDE documentation (it's his first KDE project).
This is going to be a great summer :)

Anonym hat gesagt…

I talked once to the maintainer of Voxforge, and one of the most important things to do is to review all the voice submissions. To raise the quality there needs to be a system to flag unintelligible, noisy or wrong submissions. Before this happens, more submissions will not result in a better model. However, many submissions are good for the project of course, the more the better of course, and with more diversity there could be made different models (male, female, young, dialect) etc.

Peter Grasch hat gesagt…

Yes, of course there needs to be a process in place to ensure the quality of the submitted samples.

However, there are two reasons why this isn't just a dump of a lot of data:
1. Normally users don't train simon for the trainings sake. As long as the model doesn't work for them I doubt many would share the samples by uploading it to the internet so we shouldn't get any completely wrong samples.
2. During training simon actually checks the recorded samples for an appropriate signal to noise ratio, clipping, etc. so there is another layer of quality control already in place.

The recorded samples should therefore be at least as good as the ones collected from the web interface.

To check the gathered samples the simon application suite already contains a little utility (haven't blogged about it yet) called "afaras" ("automatically find and remove amiss samples"). It provides a simple way to blacklist wrong samples.
Afaras can also read the build log of a model generation to automatically sort the samples you are checking according to the segmentation score they received during training. That way the recordings that the system thinks are wrong are first in your checking queue. Even if you only listen to the first 10% of all samples you should catch most odd ones.

Best regards,

United against hat gesagt…

How is this coming along. I would love to use it but see the simon has not been updated in a long time. My mother wants to use the computer and would feel better using her voice for every thing rather then the keyboard and mouse. I know that Windows has this built in but do not want to load Windows if I can help it.

Peter Grasch hat gesagt…


I'm sorry but simon can not yet compete with Windows' built in speech recognition and won't be able to in the near future.

For technical reasons, we can't provide free dictation (large vocabulary recognition).

Best regards,