Smart Spies: How Alexa and Google Home expose users to vishing and eavesdropping

11 min
I
July 13, 2022

The increasing functionality of smart speakers also leads to a growing attack surface for hackers. In 2019 research by SRLabs unveiled two scenarios hackers might abuse both Alexa or Google Home to spy on users. The vulnerability allows third parties to create functions that use vishing (voice phishing) methods or eavesdrop on the user. The researchers demonstrated the hacks by creating voice applications for both device platforms, turning the smart assistants into ‘Smart Spies’.

Third-party developers get access to user inputs

Both Skills (Alexa) and Actions (Google Home) can be activated by calling out the invocation name chosen by the developer: “Alexa, open Netflix”. Users can then call functions (Intents) within the application by speaking specific phrases: “Play Star Trek: The Next Generation”. These phrases can also include variable arguments by users as slot values (variable user input that is forwarded to the application). These input slots are converted to text and sent to the applications backend, which are often operated outside the control of Amazon or Google.

How hackers may abuse smart speaker development functionality

Through the standard development interfaces, the SRLabs researchers were able to compromise the data privacy of users in two ways:

  1. Request and collect personal data including user passwords
  2. Eavesdrop on users after they believe the smart speaker has stopped listening

The ‘Smart Spies’ hacks combine three building blocks:

  • The “fallback intent”, a voice apps default response when it cannot assign the user’s last command to an intent and should offer help: “I’m sorry, I did not understand that. Can you please repeat it?”
  • To eavesdrop on Alexa users, the team exploited the built-in 'stop intent' when uses say “stop” and that developers are allowed to change an intent’s functionality after the application has already passed the platform’s review process.
  • Finally the researchers exploited the possibility in the Text-to-Speech engines that allows developers to add long pauses in the speech output.

Hack 1: Requesting a user’s password

It is possible to ask for sensitive data such as a password from any voice app. To create a password phishing Skill/Action, a hacker would follow these steps:

  1. Create a seemingly innocent application that includes an intent triggered by “start” and takes the next words as slot values. This intent behaves like the fallback intent.
  2. A hacker could then change the functionality after the app has been reviewed, as this does not prompt a second review. (Like most app market places, Amazon or Google review voice app before it is published.) -> In the test, the researches changed the welcome message to an error message, making the user believe the application has not started and is no longer listening: “This skill is currently not available in your country.”
  3. After the error message the team added a long audio pause by making the voice app “say” the character sequence “�. “ (U+D801, dot, space). Because this sequence can not be pronounced the speaker remains silent but is still active. Repeating the sequence increases the length of this silence.
  4. After a while the silence is ended by playing a phishing message: “An important security update is available for your device. Please say start update followed by your password.”.

Now anything the user says after “start” is send to the hacker’s backend. That’s because the intent, which acted like the fallback intent before, saves the user input after the "start" as a slot value.

Hack 2: Eavesdropping on users

In a second experiment, the researchers at SRLabs were also able to listen to conversations after users believed to have stopped our voice app. To accomplish this, the team had to adapt their strategy for each of the voice speaker platforms.

Eavesdropping with Amazon Alexa

For Alexa devices, the voice recording is started by the user calling certain trigger words, that are defined by the developer of the skill. This can also include common words such as “I” or words indicating that some personal information will follow, like “email”, “password” or “address”.

To researches created an eavesdropping skill following these steps:

  1. Create a seemingly innocent skill that contains two intents:
    • an intent that is started by “stop” and copies the initial stop intent
    • another intent that is started by a certain, commonly used word and saves the following words as slot values. This intent behaves like the fallback intent
  2. Once Amazon reviewed the skill, change the first intent to say goodbye, but keep the session open and extend the eavesdrop time by adding the character sequence “�. “ (U+D801, dot, space) multiple times
  3. Change the second intent to not react at all

If the user tries to end the malicious skill, they hear a goodbye message, but the skill continues to run for few seconds. If the user starts a sentence beginning with the word selected in step 1 within this time, the intent will save the sentence as slot values and send them to the attacker.

Eavesdropping with Google Home

For Google Home devices, the hack is more powerful, because there is no need to specify certain trigger words and the hacker can monitor the user’s conversations infinitely.

This is possible because Google allows putting the user in a loop where the device is constantly sending recognized speech to the hacker’s server while only outputting short silences in between.

To create such an eavesdropping Action, a hacker follows these steps:

  1. Create an Action and submit it for review
  2. After review, the main intent is changed to end by using the Speech Synthesis Markup Language (SSML) to play a recording of the Bye earcon sound (This sound usually indicates that a voice app has finished) and setting expectUserResponse to "true".  After that, several noInputPrompts are added, consisting only of a short silence, using the SSML <break> element or the unpronounceable Unicode character sequence “�. “.
  3. Create a second intent that is started whenever an actions.intent.TEXT request is received. This intent returns a short silence and defines several silent noInputPrompts.

After outputting the requested information and playing the Bye earcon, the Google Home device waits for approximately 9 seconds. If no speech input is detected, the device "plays" a short silence and waits for user input again. The Action actually stops if no speech input is detected within 3 iterations.

If speech input is detected, a second intent is started. This intent only consists of one silent output, again with multiple silent reprompt texts. Every time speech is detected, this Intent is called and the reprompt count is reset.

The hacker receives a full transcript of the user’s subsequent conversations, until there is at least a 30 second break with no speech input. (Hackers could extend this break by extending the silence duration, during which the eavesdropping is paused.)

In this state, the Google Home Device will also forward all commands prefixed by “OK Google” (except “stop”) to the hacker. Therefore, the hacker could also use this hack to imitate other applications, man-in-the-middle the user’s interaction with the spoofed Actions and start believable phishing attacks.

The researchers would also have been able to also request the corresponding email address and try to gain access to the user’s Amazon or Google account.

Conclusion

Alexa and Google Home are powerful, and the smart devices can be very useful, especially in private environments. However, their implications for privacy are reaching further than what many users might know. Users need to be aware of the possibilities for hackers to use malicious voice apps to abuse their smart speakers. Using a new voice app should be approached with a similar level of caution as installing a new app on your smartphone.

Amazon and Google need to implement better protection, starting with a more thorough review process of third-party applications made available in their voice app stores. The voice app review needs to check explicitly for copies of built-in intents. Unpronounceable characters like “�. “ and silent SSML messages should be removed to prevent arbitrary long pauses in the speakers’ output. Suspicious output texts including “password“ deserve particular attention or should be disallowed completely.

The original research was done by Fabian Bräunlein (@breakingsystems) & Luise Frerichs and published on SRLabs.com.

The researches shared their findings with Amazon and Google through their responsible disclosure process.