A dive into the scientific, user experience and product based issues around voice user interfaces.
Hey Siri! What does the fox say? Voice-user interface and virtual assistants such as Siri make communication with computers achievable through speech recognition coupled with text to speech as a reply. Aside from spoken language understanding (SLU) products and features being relatively new to the market, the tangible obstacles that lie between consumers and their progress toward widespread adoption of voice user interfaces (VUI) stem from current technical limitations clasped with use cases that are incongruent with those limitations. The technology behind voice user interfaces, typical product development processes, problems with existing VUI based products, and possible future developments all have an impact on the adoption of VUI based products.
The first argument why voice user interfaces are important in the realm of computing involves its tremendous research progress. Shawn DuBravac, the chief economist at the Consumer Technical Association, describes that the “improvements in natural language processing have set the stage for a revolution in how we interact with tech”. Natural language processing is a subfield of linguistics and computer science involving reading, deciphering, understanding and making sense of human languages. Since the dawn of natural language processing, the error rates have aggressively decreased from 100% in 1994 to 23% in 2012, and just 6% in 2017. Just like with other technologies such as the advancements in semiconductor technology that made personal computing feasible, the rapid pace in NLP research indicates that it's likely to become a commonplace way for people to interact with their computers. In conclusion, the speculation related to NLP’s promising future is a result of the vigorous research into it.
Moreover, VUI is critical to the future of computing because it makes computing more accessible. Jason Amunwa, a product consultant, describes voice as the “new pinnacle of intuitive interfaces that democratize the use of technology”. VUI’s faceless human-computer interaction (HCI) mimics the way humans interact with each other. In turn, that holds the possibility to make computing accessible to a wide array of people, for example, the blind. VUI is promising because of its likelihood of making computing more intuitive. It’s promising to groups that wouldn’t ordinarily have access to computers such as blind people.
In addition to the points expanded on above, another reason why VUI is crucial lies in how simple it could potentially make computing. As a result of the voice-based human-computer-interaction, “VUI can be used to improve the user experience by shortening the operation process chain”. The operation process chain refers to the steps required to complete an action. For example, if a person wanted to send an email to a classmate, they would have to first find their classmates’ email, open the email application, compose the email and then send it. Voice user interfaces can leverage it's interface to make computing simpler by flattening the process chain.
Some Technical Stuff
Next, the technical aspects behind voice user interfaces beg for consideration. The technology behind VUI is composed of two layers of machine learning under the arch of Spoken Language Understanding (SLU). SLU is “usually broken down into two parts: Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU)”. The ASR utilizes an acoustic model, which are trained to recognize patterns and structures from the audio fed into it to discern phonetic representations from a voice recording, then those phonetic representations are analyzed by a language model that turns the spoken utterance into text. After that, using the derived text the NLU would discern intents and slots. Implementation of NLUs could be observed even in non-voice-based user interfaces like chatbots, or a Google search ie. “find me the closest restaurant” or “cheap flights to DC” which results in a specialized component typically seen on top of search results. In addition to utilizing machine learning models, NLU’s work by binding specified words or phrases to specific 'intentions' which are then setup to trigger functions related to that intent. The slots and variables are other parts of the user input witch are used as parameters of that function. So in the first example, the 'find me' would trigger an intent which would then call a function, make a request to a series of predefined actions sometimes with data as an input, designed specifically for looking through perhaps a list of addresses, and the slots such as 'closest' and 'restaurant' would narrow down the addresses to consider. Once that's complete the information is sent back to the user and viewable on their device. In conclusion, this combination of ASR and NLU serves as the spine behind voice user interfaces.
It's important to realize, spoken language understanding still requires large amounts of computing resources. The “model achieving human parity in  is a combination of several neural networks, each containing several hundreds of millions of parameters, and large-vocabulary language models made of several millions of n-grams". As Coucke described, a consequence of the size of the models and possible parameters is that it takes a significant amount of computing resources to ultimately discern intentions and slots. The computing resources necessary mean instead of smartphones, tablets, computers or any other end-user devices handling the process, specialized computers usually housed in data-centers also known as 'the cloud', connected to the end-user via the internet are required to access the voice-based functionalities. Until there are more developments in end-user device processing, one downside of this is that VUI's functionality is limited to having an ongoing internet connection. It could be assumed that being required to connect to the internet, leads to users not feeling confident in when VUI is available. In conclusion, the magnitude of computing resources required are an example of one of the limitations of voice user interfaces.
Where VUI Falls Short
Equally important are the aspects of voice user interfaces that prevent it from widespread use. One commenter in a Hackernews thread warns that “Voice as a user interface doesn't scale. If you have a lot of people in a small space (an office, a coffeeshop, a metro train, etc) then voice just won't work”. If a user is in a crowded space it's difficult for the natural language processing system to extract usable language data from a voice recording, limiting VUI's functionality to only quiet areas. Consequently, while it's possible to have a large number of people in a dense area using the touch screen functionality of their phones, in contrast, it's unlikely to have a large number of people in a dense area interacting with their devices through VUI. Another limitation of voice user interfaces is their inability to be used in compact, noisy areas.
In addition to that, there are phycological reasons why voice user interfaces are difficult to use. Daniel Westlund a startup mentor based in Berlin describes how “the timeframe within which the interaction takes place, and the required cognitive load, is compressed. During the interaction, attention is high as the user can not control the speed of information flow”. Consequently, cognitive load is also high which can lead to a poor experience and increased user errors. While voice user interfaces allow for multiple steps to be completed at once. The heavy cognitive load, and the significant amount of attention required to for example, think of what to say, then say it in a way that the user thinks the computer will understand it enough to complete their intended action makes completing tasks with VUI difficult and ultimately unappealing. As illustrated by Westlund's quote, by possibly building products around use cases that require a smaller cognitive load VUI's may be more painless to use.
One hindrance is whether an operating system's APIs are public or private. Application program interfaces allow different parts of a computer to communicate with each other, private APIs are accessible by a specific set of applications whereas public ones are not. Due to which layer, operating system, application, page, the VUI is implemented or which data and APIs it's allowed access to, VUI may not be able to compress an entire process chain. However, even within a narrow breadth such as being implemented only in an application as opposed to an entire operating system, it's highly achievable for VUI to make what would have been at least two steps, one voice command.
The next issue that plagues voice user interfaces involves discoverability of features. Ewa Luger and Abigail Sellen researchers at Microsoft, describe that "half of users explicitly stated that they did not know what their CA could do. This resulted in them either feeling overwhelmed by the unknown potential, or led them to assume that the tasks they could accomplished were highly limited." In contrast to graphical user interfaces which lay out all the possible features and paths to more features, voice based user interfaces rely on interfacing through voice therefore, it's approach of one intent as an input and one result as an output makes it difficult for the user to understand or know the possible features of a voice user based interface application as it does not present all the features a graphic interface would.
On the other hand, there are examples of VUIs success. One example of this is how "Amazon went from a handful of skills to hundreds and thousands and tens of thousands. Apple hasn’t really built a developer ecosystem". Amazon nurturing its third-party developer ecosystem resulted in a robust third-party developer community that allowed for more creative applications and the success of its Alexa platform. However, some experts have concluded, Amazon's effort to build a developer ecosystem may be the reason behind the Alexa device's high sales numbers.
Another key aspect of how VUI's could become widely used is the product development process. One product development process is the product-driven process which "may be used for a market that is well known, where the customer needs may be predictable, and where the competition is understood". The product-driven process is one approach to developing software. You have an initial idea that a product group (designers, engineers, managers) build a robust product around, and then find a market for that product. The problem with the product-driven approach is that user feedback isn't taken into account by the time the product is released, so useless features may be developed, or key problems may not have been solved. One the other hand, there's the agile process that utilizes "an incremental process that is used to facilitate an iterative interactive learning approach between an organization and its customers to gather insights in order to develop a successful product". The agile or customer-driven process allows for customers’ needs to be taken into account before the development begins which eliminates useless features and steers the product team towards developing a product that solves a user's problem. In terms of voice user interfaces, because spoken language understanding is just now becoming stable, it's hard to imagine there is enough maturity to yield solutions to every user problem. While it's difficult to determine which product development process existing VUI based products and features have used, by utilizing the agile process, product teams will be more capable of finding a good application for VUI as the newer and iterative agile process typically results in higher product success rates compared to Waterfall.
Next, in terms of possible improvements to VUI, it's important to consider future research developments. Jinyu Li, an IEEE member and scientist at Microsoft, believes that "the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied to advance the field further". As touched on earlier, one problem with VUIs is that its functionality is defined by how quiet or noisy an area is. With ASR just emerging from its infancy, as Li describes, there hasn't been much research into even the advantages and disadvantages of specific methods for solving the problem of ASR being unable to work in noisy areas. Ultimately, in addition to the possibility of VUI being unrestricted to quiet areas, Li’s quote indicates that since VUI is at its dawn, as more research is done, different technical problems abating the adoption of VUIs could find solutions.
Additionally, for VUI's to have widespread adoption their APIs must be opened up. In fact with “SiriKit for audio apps, Spotify can now behave like Apple Music when it comes to Siri commands on iOS". Indeed, before SiriKit for audio apps was released, the only music app Siri could interact with was Apple music. While SiriKit for audio apps is one example, making other facets of the Siri API public would allow Siri to be a platform, which would allow for third-party developers to build their application of VUIs. In addition to this, a parallel could be drawn between the current state of VUI services like Siri and that of mobile phone before the widespread use of app stores. More so, an example of something becoming a platform is Apple's very own app store which resulted in the 'app boom' over a decade ago and differentiated the iPhone from its competitors. In conclusion, virtual assistants, such as Siri or the Amazon Alexa with open APIs could be one way to usher in the widespread adoption of VUIs.
In conclusion, aside from SLU products and features being relatively new to the market, the tangible obstacles that lie between us and our progress toward widespread adoption of VUIs stem from current technical limitations coupled with use cases that are incongruent with the current technical capabilities of VUI. However, the adoption of VUIs could be improved by implementing the research from VUIs, opening up private APIs and utilizing the agile development process.
Possible Areas to Explore
Another reason why VUI is so promising is it's potential to unlock the full power of ML and big data. [I'm assuming] graphic user interfaces aren't able to dynamically generate interfaces to functions in a way that's familiar to the user instantly, while a audio based approach possibly can.