Developing for Google Home vs. Amazon Echo

by David Bolton Jun 29, 2017 8 min read

Smart speakers are a new class of gadget that uses voice-recognition and machine learning technology to provide access to information, news, and music. As developers may already know, some of these devices are programmable—for example, you can build “skills” for Amazon’s Alexa digital-assistant platform, which powers hardware such as the Echo. But is developing for these platforms a pleasant experience?

Amazon Echo

Amazon launched the Echo in mid-2015 in the United States, followed by rollouts in other countries; it followed that up with the Amazon Dot, a smaller device, and Amazon Tap, an ultra-portable version. Each proved a solid seller, and the e-commerce giant soon ramped up production on a version with a screen (the upcoming Echo Show). These devices’ “brain,” Alexa, has two parts: Alexa Skills and the Alexa Voice Service. In simplest terms, an Alexa skill is best described as a mini-application invoked by a phrase prefixed by "Alexa." Amazon has tried to make the development process as simple as possible, offering an Alexa Skills Kit that you can use to create a skill. Each skill features two components: an Intent Schema (a JSON structure) and Spoken Input Data (Sample Utterances, a structured text files that connects “intents,” and Custom Values, or values of specific terms, referenced in the indents). There are over a dozen different ways to say the same thing, including Ask, Tell, Search, Talk To, Open, Launch, Start, Resume, Run, Load, Begin and Use – all combined with the invocation name and the intent. Things can get pretty complex in rapid order. If the invocation name is “games pack” and the intent is “poker,” you might say: "Alexa, launch games pack and run poker." While Amazon’s documentation presents building for Alexa in a straightforward way, you really need to diagram the skill with care. For example, let’s say you want to build a skill that will order a taxi. The user needs to specify where they want to be picked up from, where they’re going, when they want to be picked up, the number of passengers, and any special requests (disability-accessible, etc.). That's a lot of dialog and responses, and it takes a lot of time to build a natural flow. If you don’t take that care, the skill will have incomplete functionality or result in a conversational “dead end,” and nobody will use it. Amazon provides specialized APIs as part of Alexa’s skills: for example, the Smart Home Skill API (for lights, thermostats and so on), the Flash Briefing Skill API (for news flashes), and the Video Skill API (change TV channels, pause video playing, and so on). For these, you have to use AWS Lambda, and can write code in Java, Node.js, Python or C#. For anything that falls outside of these specialist Skill APIs (such as taxi-ordering, to loop back to our previous example), you have to write custom code yourself; in these instances, you can either use AWS Lambda or provide your own web service. I suspect that most developers will go down the custom route, but if you don't have time to write code yourself, you can use the Alexa Voice Service, as well as any of the skills available in the Amazon Skills Store. If you write your own code, you have to define the requests that the skill can handle. For a taxi, you might say: “Order a Car.” This is mapped to an OrderCar intent: “Alexa, order a car.” We’ll create an OrderTaxiSpeechlet in Java:

public class OrderTaxiSpeechlet implements Speechlet {

The OrderCar intents will now have to do the following:

Request a Pickup address.
Get the Address from the User.
Request a DropOff address.
Get the Address from the User.
Request Number of Passengers
Book the Taxi

This will be done by overriding the onIntent() method, something like this:

@Override

public OrderTaxiSpeechlet onIntent(final IntentRequest request, final Session session)

throws SpeechletException {

log.info("onIntent requestId={}, sessionId={}", request.getRequestId(),

session.getSessionId());

 

Intent intent = request.getIntent();

String intentName = (intent != null) ? intent.getName() : null;

 

if ("Pickup".equals(intentName)) {

return getAddress(intent, session, true);

} else if ("DropOff".equals(intentName)) {

return getAddress(intent, session, false);

} else if ("NumberPassengers".equals(intentName)) {

int passengers = new PlainTextOutputSpeech();

passengers.setText("");

return SpeechletResponse.newPassengerResponse(output);

} else if ("BookTaxi".equals(intentName)) {

return BookTaxi(intent, session);

} else if ("AMAZON.HelpIntent".equals(intentName)) {

return getHelp();

} else if ("AMAZON.StopIntent".equals(intentName)) {

PlainTextOutputSpeech outputSpeech = new PlainTextOutputSpeech();

outputSpeech.setText("Goodbye");

return SpeechletResponse.newTellResponse(outputSpeech);

} else if ("AMAZON.CancelIntent".equals(intentName)) {

PlainTextOutputSpeech outputSpeech = new PlainTextOutputSpeech();

outputSpeech.setText("Goodbye");

 

return OrderTaxiSpeechlet.newTellResponse(outputSpeech);

} else {

throw new SpeechletException("Invalid Intent");

}

}

Google Home

Google’s own digital-assistant hardware, Home, opened up to development in late 2016. Google Assistant, which runs on smartphones, powers Home. When it comes to voice apps, Assistant boasts a lot of similarities to Alexa. With Google, Actions define your app's external interface. There are two components: an intent, which is a simple messaging object that describes the action, and fulfilment, which is the web service that can fulfill the intent. Actions are defined in a JSON action package. If you’re interested, start by heading over to the Google Developers’ portal for Actions. Create an Actions Project, and an API A.I. agent that defines your intents. Once those are set up, provide fulfilment, which is code that is run on Firebase (much like AWS Lambda; both request and response messages are JSON). You can either choose Cloud functions on Google Cloud, or do your own web services so long as they support HTTPS and can parse JSON. Google provides a node.js client library as well as examples for deploying to Google App engine and Heroku. Each action package has one actions.intent.MAIN intent. This is the default action, and is specified in JSON. Here’s a useful example from the Eliza sample on Github: first, the main Intent handler, then one for raw input:

const mainIntentHandler = (app) => {

console.log('MAIN intent triggered.');

const eliza = new Eliza();

app.ask(eliza.getInitial(), {elizaInstance: eliza});

};

const rawInputIntentHandler = (app) => {

console.log('raw.input intent triggered.');

const eliza = new Eliza();

const invocationArg = app.getArgument(INVOCATION_ARGUMENT);

const elizaReply = invocationArg ? eliza.transform(invocationArg) :

eliza.getInitial();

app.ask(elizaReply, {elizaInstance: eliza});

};

Conclusion

Both Amazon and Google provide design documents for Voice UI and simulators, so you can really do a lot of research and experimentation before you actually build. (For Amazon, that functionality is available via Echosim; for Google Assistant, there's an Actions Simulator.) Thanks to all those resources, it’s easy to begin shaping out an idea for an action or skill, no matter which platform you choose; but voice commands are a deceptively difficult thing to get right, and developers could become snarled up unless they carefully diagram out functionality beforehand. The game will only get more complicated later this year, when Apple introduces the HomePod, which leverages Siri. That means developers who want to work on all available platforms will need to become familiar with yet another company’s software. Fortunately, there’s SiriKit, as well as documentation that breaks down Apple’s take on intents. Welcome to the beginning of the future.

@Override public OrderTaxiSpeechlet onIntent(final IntentRequest request, final Session session) throws SpeechletException { log.info("onIntent requestId={}, sessionId={}", request.getRequestId(), session.getSessionId()); Intent intent = request.getIntent(); String intentName = (intent != null) ? intent.getName() : null; if ("Pickup".equals(intentName)) { return getAddress(intent, session, true); } else if ("DropOff".equals(intentName)) { return getAddress(intent, session, false); } else if ("NumberPassengers".equals(intentName)) { int passengers = new PlainTextOutputSpeech(); passengers.setText(""); return SpeechletResponse.newPassengerResponse(output); } else if ("BookTaxi".equals(intentName)) { return BookTaxi(intent, session); } else if ("AMAZON.HelpIntent".equals(intentName)) { return getHelp(); } else if ("AMAZON.StopIntent".equals(intentName)) { PlainTextOutputSpeech outputSpeech = new PlainTextOutputSpeech(); outputSpeech.setText("Goodbye"); return SpeechletResponse.newTellResponse(outputSpeech); } else if ("AMAZON.CancelIntent".equals(intentName)) { PlainTextOutputSpeech outputSpeech = new PlainTextOutputSpeech(); outputSpeech.setText("Goodbye"); return OrderTaxiSpeechlet.newTellResponse(outputSpeech); } else { throw new SpeechletException("Invalid Intent"); } }