I’m not a huge fan of speech interfaces, or of Internet connected home assistants a la Alexa in general.
I have already complained that these devices are potentially nasty invaders of privacy and likely to have complicated failure modes not least due to a plethora of security issues. (Possibly a plethora of plethoras.) I’ve also complained about the psychology of surveillance and instant gratification inherent in the design of these things. Especially for childred.
Pretty much exactly what you don’t want in your home.
This fall a group at Zhejiang University report on yet another potential issue: Inaudible Voice Commands .
Contemporary mobile devices have pretty good speakers and microphones, good enough to be dangerous. Advertising agencies and other attackers have begun using inaudible sound beacons to detect the location of otherwise cloaked devices. It is also possible to monkey with the motion sensors on a mobile device, via inaudible sounds.
Basically, these devices are sensitive to sound frequencies that the human user can’t hear, which can be sued to secretly communicate with and subvert the device.
Zhang, Guoming and colleagues turn this idea onto speech activated assistants, such as Alexa, or Siri . They describe a method to encode voice commands into innocent sounds. The humans can’t hear the words, but the computer decones it and takes it as a voice command.
These devices are capable of almost any operation on the Internet. Sending messages, transferring money, downloading software. The works.
In other words, if this attack succeeds, the hacker can secretly load malware or steal information, unbeknownst to the user.
Combine this with ultrasound beacons, and the world becomes a dangerous place for speech commanded devices.
The researchers argue that
“The root cause of inaudible voice commands is that microphones can sense acoustic sounds with a frequency higher than 20 kHz while an ideal microphone should not.”
This could be dealt with by deploying better microphones or by software that filters out ultrasound, or detects the difference between voiced commands and the injected commands.
I would add that a second root cause is the number of functions of these devices, and the essentially unimodular design of the system. Recent voice activated assistants are installed as programs on general purpose computers with a complete operating system, and multiple input and output channels, including connections to the Internet. In general, any program may access any channel and perform any computation.
This is a cheap and convenient architecture, but is arguably overpowered for most individual applications. The general purpose monolithic device requires that the software implement complicated security checks, in an attempt to limit privileges. Worse, it requires ordinary users to manage complex configurations, usually without adequate understanding or even awareness.
One approach would be to create smaller, specialized hardware modules, and require explicit communication between modules. I’m thinking of little hardware modules, essentially one to one with apps. Shared resources such as I/O channels will have monitors to mediate access. (This is kind of “the IoT inside every device”.)
This kind of architecture is difficult to build, and painful in the extreme to program. (I’m describing what is generally called a secure operating system.) It might well reduce the number of apps in the world (which is probably a good thing) and increase the cost of devices and apps (which isn’t so good). But it would make personal devices much harder to hack, and a whole lot easier to trust.
- Guoming Zhang, Chen Yan, Xiaoyu Ji, Taimin Zhang, Tianchen Zhang, and Wenyuan Xu, DolphinAtack: Inaudible Voice Commands. arxive, 2017. https://arxiv.org/abs/1708.09537