Simon: Open-Source Speech Recognition: (Simond Model Caching)²

During last years Google Summer of Code, Adam Nash developed a system for letting Simon react to changes in the computing context: Simon can for example change its scenario selection depending on the currently running applications or the name of the active window.

The system has been designed with extendability in mind so that new conditions can be added easily.

Apparently this idea is interesting enough for Apple to try to patent it at the moment.

An easy way to implement context dependence would be to simply deactivate commands when they are not applicable. However, by dynamically creating speech models tailored to the current situation, the recognition rate can be improved considerably.

But creating context dependent speech models leads to a problem: Building models is very time consuming. As the context usually changes very often, the switch between speech models has to be fast.

To compensate, Adam developed a simple caching solution for Simond.
While it worked okay for most use cases, it was a bit buggy and the design had some issues. Because of that, it would have been very hard to switch the model compilation backend (e.g. exchange the HTK with SPHINX).

So during the recent refactoring I also rewrote the context adaption and caching system in Simond.

"So isn't this, like, really easy?"

The premise seems quite straight forward: Whenever the situation changes, try to find the new situation in a cache: If found, use the old model, if not build a new one and add it to the cache.

However, it's not quite as simple: Input files may change very often. However, there are a lot of changes where it's absolutely predictable that the resulting model won't change. Architecturally speaking, this depends on the model creation backend (in this case the HTK) so an independent caching system can't really identify those situations.

The input files may even change during the model creation process.
An example: Someone with a user generated model has two transcriptions for a single word but only training samples for one of them. Because the training data is transcribed on a word level this can only be identified during the model creation. If a (tri)phone of the alternate (unused) transcription is now undefined (untrained), it needs to be removed from the training corpus. Associated grammar structures might now be invalid, etc. Again, this would mean that the caching system has to be integrated with the model creation backend.

But moving the model caching system to the backend isn't a nice solution either as that would mean that each backend would need to implement it's own cache.

"Oh..."

So to enable sensible caching with multiple backends I ended up with an a little bit more complicated, two layered approach:

Model input files would be assigned a unique fingerprint. Source files with the same finger print are guaranteed to produce the same speech model. The finger print is calculated by the model creation backend. This way the calculation can take just those parts of the input files into account that will have an effect on the produced speech model.
In practice this for example means that changing command triggers or adding a grammar sentence with no associated words will produce the same finger print and therefore not trigger the costly re-creation of all associated models.
The current context is be represented through "situations". The cache contains an association between situations and the finger print they will provoke. Multiple situations might share the same finger print (the same speech model). Once a cached model has no situations assigned to it's activation, it will be removed from the cache.

The resulting workflow looks something like this:

To ensure maximum responsiveness, Simond will try to update cached models when the associated input files change. So if you have three situations for your model and add some training data, all three models will be re-evaluated in a separate thread.

The model creation itself uses a thread pool to take advantage of multi-core systems and actually scales very well.

Still, the model creation process can take minutes if you have a lot of training data - even on a decent CPU.

"But what about entirely new situations?"

Creating and maintaining a model cache for all possible situations wouldn't be feasible as the cache would of course grow exponentially with the number of conditions to consider.

To avoid having to wait for the creation of a model for the new situation, the context system was designed to create and maintain the most permissive model available as a fallback.

Let's consider an example: Suppose you have a setup with three scenarios - Firefox, Window management, Amarok - and you configure Simon to activate the Firefox and Amarok scenarios only when the respective applications are running.
The created fallback model would have all three scenarios activated.
Suppose you open and close Firefox quite frequently so those two situations are covered with an up-to-date model. You are currently in the situation that both Firefox and Amarok are closed. Again, there's a model for that. Then you open Amarok for the first time: The correct model would have a disabled Firefox scenario and an activated Amarok scenario.
As the requested model is not available, Simond will now start to compile it. In the mean time, Simond will switch to the fallback model: The one with all scenarios (Firefox, Amarok and the Window Management scenario) activated.

When picking a model to build, the fallback model is given higher priority to ensure that it's (almost) always available.

By the way: Simond sends the compiled speech model back to Simon during synchronization. This is done both to shorten the time it takes the recognition to start in a multi-server environment (think of mobile clients) and to ensure the last successfully compiled model is available in case that the current input files can not be compiled and the client connected to a "fresh" server. Of course only the most fallback model is synchronized to keep the network overhead low.

"But what about ambiguous commands?"

There might be setups where commands have different meanings depending on the context. For example "Find" might have issue "Ctrl+F" in LibreOffice but open Google when issued while browsing the web.

To avoid situations of undefined behavior while the targeted model is compiling, deactivated scenarios are not only removed from the speech model on the Server side but their commands are also disabled on the client side.

That means the only drawback of the more permissive model is a lower recognition rate for the time it takes Simond to create the new model - ambiguous commands will still be handled correctly.

As soon as the more targeted model is finished building, the recognition will switch automatically.

"Isn't this post getting too long?"

Yes, definitely.

So to sum up: Simon 0.4 will feature a sophisticated model caching and context adaption mechanism.

The code involved is of course very young and even though everything works fine on my machine I of course expect there to be problems. If you are running Simon 0.3.80 or above, please report any issues you might have on the bug tracker. Thanks!

Simon: Open-Source Speech Recognition

Montag, 7. Mai 2012

(Simond Model Caching)²

"So isn't this, like, really easy?"

"Oh..."

"But what about entirely new situations?"

"But what about ambiguous commands?"

"Isn't this post getting too long?"

Keine Kommentare: