Simon: Open-Source Speech Recognition: September 2012

As some of you might already know, I spent last week at KDEs annual Randa Mettings in the beautiful city of Randa, Switzerland.
It was my first sprint and I was genuinely surprised at how unbelievably productive it really was. It's amazing what a couple of committed developers can achieve in just a couple of days.
The awesome food and plentiful supply of Swiss chocolate doesn't hurt, of course.

This sprint was the first time I got to meet José Millán Soto and Amandeep Singh who are working on AT-SPI with the help of Frederik Gladhorn, who was also there, and Sebastian Sauer, who sadly could not make it. Alejandro Piñeiro from the GNOME Accessibility team also joined the hackfest to provide valuable insights in the GNOME a11y stack and Yash Shah, one of Simons GSOC students this year, flew in once more to work on the computer vision aspects of Simons context layer.

Watch the Planet(s) for updates about all the great work those guys have been doing.

DBus Context Conditions

To warm up, on the first day I tackled a couple of items that were on my todo list for some time already. One of these tasks was to finish the implementation of the DBus context condition plugin.

Through the DBus condition, the Simon context layer can more accurately reflect the current state of other applications.

The main benefit of this feature is to allow application developers that write software that is specifically meant to be voice controlled to dynamically configure Simon to their softwares needs. By exposing the state of their system over DBus, Simon can react by activating and deactivating commands, vocabulary, grammar, microphones or sample groups as needed.

However, not only custom-written solutions gain benefit from DBus conditions: The screenshot above, for example, configures Simon to deactivate itself while VLC is playing something. Ever wanted to disable Simon automatically while listening to music? Now you can.

AT-SPI

The big topic of the week was of course AT-SPI: Through AT-SPI, assistive technologies like screen readers can "see" running applications, follow the focus and react on changes. Traditionally, KDE 4 provided no real support for this and was therefore largely inaccessible for the large group of users that rely on such technology.

In recent years, however, there has been a lot of work to complete the AT-SPI support in Qt and KDE and thanks to the relentless work of people like Frederik Gladhorn, the situation is already much improved. Screen readers are starting to work with KDE software to some extend and overall the AT-SPI framework (qt-atspi) is becoming more complete and stable every day.

In Simon, AT-SPI can be used to automatically parse and control applications without any prior configuration by the end-user. A prototype was already implemented last summer.

While writing this plugin, I used the AT-SPI bus directly and noticed significant differences between Qt and GTK in the way they represented widgets in AT-SPI. The plugin therefore needed a lot of code just to maintain the internal view of the focused application - a problem that is shared with other a11y clients as well.
With the introduction of QAccessibilityClient, a new client library to aid developers of AT-SPI clients (assistive software), this simply didn't make sense anymore and a rewrite was in order.

Actions

Next to exposing information about widgets, AT-SPI also provides a way to interact with them through Actions. In Simon these will be associated with saying the name (e.g. text of a button) of the widget in question.

Because Simons AT-SPI plugin is the first real benefactor of this technology, many popular widgets don't yet expose proper actions - but Amandeep and José are fixing those problems left and right.

Selecting one of two available actions for an activated tab

At the Randa sprint, we also had a very productive meeting to discuss broader issues like the handling of default actions and custom actions at a toolkit level.

Performance

The AT-SPI plugin parses the currently focused window, builds vocabulary and grammar and then triggers the synchronization to build a new, active model to reflect the changes. The problem with this is that this might happen every other second in practice because of the user opening context menus or dialog, changing button texts, etc.

This imposes major performance problems - especially because users don't want to wait a couple of seconds after opening a popup menu to say the next command.

While many comparatively simple performance improvements over the old prototype were implemented - like moving the AT-SPI watcher to a separate thread - some changes are not limited to the AT-SPI plugin but also improve Simons performance in general.

Simond Synchronization 2.0

Simon communicates with the Simond server over a custom TCP protocol. The speech model components are synchronized over the network. As soon as the input data changes, a new model is generated (or loaded from the cache).

This synchronization protocol was originally developed for Simon 0.2 and introduced a significant bottleneck: Each data element (individual scenarios, training data, language model files, etc.) would be synchronized separately. This involved the server querying the client for the modification date and then either requesting the clients version, sending its own version to the client or moving on to the next component if they were already up-to-date. This took at least one full round trip per component - even if all components were already up to date.

To make the synchronization more efficient, a new synchronization protocol was defined: The client now announces the modification date of all components in its initial synchronization offering. The server then builds the synchronization strategy based on that information and, because all requests are now essentially stateless, requests or sends the components that need updating on either side asynchronously.

Caching³

Simond already had a powerful model cache, but it only kept previously built models around as long as they could be the result of a specific (context) situation. The AT-SPI plugin instead modifies the scenario itself.

Simon has no way of knowing if the same set of components will ever be shown again, but it is in general a safe assumption (e.g.: Showing a menu and closing it again returns to the same state as before opening the menu). To address this, the scenario cache was modified to also keep "abandoned" (unreachable) models available for some time. Right now, the 15 most recently abandoned models are kept in this cache.

Error handling

Automatically building the active vocabulary depending on visible controls posed another problem: Even the English Voxforge model, one of the most complete open source speech models, does sadly not cover all triphones that crop up when used with such a diverse and dynamic dictionary.

Missing triphone: Before

So more often than not, users would be presented with the dreaded Julius error that is probably familiar with most people that tried to build a scenario for an existing base model once.

The only proper fix for this issue is of course to improve the Voxforge base model to cover all available triphones.

Until this is possible, though, we now work around this issue more gracefully by analyzing the used base model in the adaption layer. Uncovered triphones are then automatically blacklisted and offending words removed from the active vocabulary. That way Simon can still be activated in such situations and only the blocked word(s) can not be recognized.

Missing triphone: Now

To make sure this is transparent to the users, the blacklisted triphones are relayed to the client, which will mark such blocked words with a red background in the vocabulary view.
This replaces the previous simple mechanic that marked all words red that had less than two training samples - something that became obsolete with the introduction of base models.

Conclusion

As I mentioned before - and the avid reader might also have guessed by the length of this post - the Randa sprint was indeed very productive.

I want to thank Mario Fux et al. for organizing this fantastic event and for all sponsors that help make it happen. You guys rock!

Simon: Open-Source Speech Recognition

Sonntag, 30. September 2012

Simon at Randa 2012