Voice interfaces have been around for quite a while, but in the last few years they have become widely used in consumer products including mobile devices and home assistants.
For anyone who remembers the first generations of this technology, it is truly stunning to see the quality of the speed recognition, and how well the systems perform even in noisy conditions with multiple speakers present.
Still, there is one problem that remains troubling: these systems are vulnerable to spoofing and counterfeiting of the master’s voice. Even simple attacks such as a recording might fool the system. For that matter, it isn’t even clear what sort of authentication should be used.
Huan Feng and colleagues at the University of Michigan discuss a method for Continuous Authentication for Voice Assistant.
They point out that “voice as an input mechanism is inherently insecure as it is prone to replay, sensitive to noise, and easy to impersonate” (, p. 1) Some systems use characteristics of speech as a form of biometric authentication (i.e., the system learns to identify individual by their voice), but this is subject to replay and can even by hidden in noise in ways that a human cannot easily detect.
The authentication problem has a couple of aspects. The system needs to know not only the purported identity of the speaker, and whether the speaker is actually present and commanding, but also that the detected message is what the speaker actually said. Furthermore, it is important to authenticate all of the speech, not just and initial connection.
(If you think about it, these challenges stem from the very advantages that make voice commanding attractive. The system is hands, and everything else, free, there interaction is fluid and natural, without obvious “log in” or “log out”, and the messages are similar to natural language, without metadata or “packets” that might carry authentication information.)
Feng’s wolverines prototyped a system that uses an accelerometer to sense the movement of the speaker, and to continuously match that against the voice signal received. This approach is a sort of two factor authentication, and also assures that the signal is authentically from a specific speaker.
One of the tricky parts is the matching algorithm, mapping the movements of the speaker to the sound picked up by the remote microphone. Their paper explains their methods and results.
This approach has a number of advantages. Obviously, a wearable sensor can is a simple and cheap device, that is closely associated with the specific person, and the motion will not be easy to fake. The method works with any language without specific adaptation, and it automatically adjusts to any changes in the user’s behavior, such as fatigue or illness that might distort their voice. They also point out that the system also detects that the user is not speaking, which should lock out any commands.
This is pretty cool!
Of course, one could question the advantage of the voice interface, if one has to wear a device in order to safely use it. Why not just put the microphone in the wearable itself? In fact, I can see that you might want to put a version of this authentication into any wearable microphone, including phone headsets.
This would have the additional benefit of eliminating the creepy “always listening” behavior of assistants. If they only listen to someone who is wearing the right sensors, and authenticated properly, then they can eliminate the need to continuous listening.
- Huan Feng, Kassem Fawaz, and Kang G. Shin, Continuous Authentication for Voice Assistants. CoRR, abs/1701.04507 2017. http://arxiv.org/abs/1701.04507