Challenges of Speech Recognition

True simplicity does not reveal the tremendous effort it requires.” attributed to Somerset Maugham

croppercapture109 3During the research phase of our Version 3 language solution, we posed what at first seemed like a straightforward challenge: we wanted learners to talk out loud more. Since 1995, our software had included speaking practice, but we found that many learners, either by choice or oversight, did not routinely use the feature. We wanted an order of magnitude increase in how much our learners talked. Of course, one can just turn up the amount of speaking required in the curriculum, but we found a number of challenges as we tested a series of prototype experiences with lots of speech included.

First, people remain self-conscious about speaking to computers; despite much progress in improving multi-modal interfaces and reducing speech recognition error rates, most people still do not talk to their personal computers. This is slowly changing as speech interfaces are being featured in everyday devices like cars and phones.  (Part of the reason they get mentioned in advertising is their novelty to many observers.) For most of our learners, their speaking experience with Rosetta Stone is the most they’ve ever talked to a computer in any language, much less the one they’re learning. So we needed a speech experience that was easy and consistent, that made speaking feel like clicking – a simple, intuitive response to a context presented in the curriculum.

Second, by default, learners seem to expect detailed feedback each time they are asked to talk. This is perhaps a natural consequence of the novelty of a speech interaction, but if not addressed it greatly constrains how much we can use speech in the product design. Imagine a helpful native speaker practicing a conversation with you, and after every phrase you say, they point out all flaws with it, in detail, and ask you to repeat it several times. While this undoubtedly would improve your pronunciation, it would quickly become so tedious that it would likely limit how much of this kind of practice you’d want to do in the future.  We needed to support a variety of levels of feedback during the learning process, from lightweight, nodding encouragement to the former kind of deeper analysis.

Finally, we are constrained by the limitations of the underlying speech recognition technology. As noted above, much progress has been made to reduce error rates over the past 50 years, yet significant challenges remain, especially in a context such as language learning. Early native speakers make a constantly evolving set of mistakes as they develop an ear for the language and get used to putting phonemes together in combinations they’ve never mouthed before. We needed a solution that would work for learners of any background through these stages of learning, and would work across 25 languages.

croppercapture111 2To find solutions within these constraints, we applied a variety of techniques, from design to technology to organizational changes. We founded a speech group within Rosetta Stone Labs, focused solely on delivering world-class speech recognition for language learning. This group is now a dozen people, including five Ph.D. researchers, and growing. They provide Rosetta Stone with our own core, proprietary speech recognition technology, which has allowed us to build our own models across many languages that have been tuned and tested with native and non-native speech.

We also found that by tuning up or down the visual prominence of the speaking task and the resulting feedback, we could set expectations in learners that allowed them to proceed through practice in the core curriculum with less need to slow down and perfect each phrase, but still provide other places where they could get rich feedback on their pronunciation.

Finally, we worked hard to keep the interface simple and consistent throughout all modes of interaction. Whether the learner is clicking the mouse, typing, selecting, or speaking, in every case the “game” is the same: completing a visual puzzle that has several missing pieces. By establishing a consistent metaphor across these modes, we were able to remove some of the oddity that learners initially felt when talking to their computers.

In external testing prior to launch, we were gratified to find that learners talked dramatically more often than in our prior courses, and more importantly, they found it completely natural to speak their new language to a computer.  We continue to search for new ways for learners to engage with speech through technology.  In future RVoice posts, members of the Rosetta Stone Labs speech team will explain how we’re continuously raising the bar to give our learners the best speech experience possible.

Find more posts about:

Greg Keim

Greg Keim is the head of our technology labs and is responsible for innovation across all products. Greg graduated from Swarthmore College in 1992 with a double major in mathematics and computer science. He joined Rosetta Stone in early 1992, spearheading technical design and development leading up to the first award-winning Rosetta Stone release. In the fall of 1994, Greg resumed his education at Duke University, where he entered a Ph.D. program in computer science and focused his studies on artificial intelligence, natural language processing and voice dialog systems. In 1999, Greg returned to Rosetta Stone, becoming the company’s chief technical officer.
blog comments powered by Disqus