Case Study 1:  Voice Portal    

April, 2002

 

       What’s a Voice Portal? 

These days, you often hear about voice portals in the telecommunications trade media.  What are they?  Although the term is sometimes used to refer to any voice-automated business-to-consumer telephone application, the narrower definition we consider here is that voice portals, like web portals Yahoo and Lycos, are single sources for a wide variety of general-interest consumer information like news, weather, sports scores, stock quotes, and personal e-mail.  The distinction with web portals, of course, is that the information is accessed by voice.  (For the moment, that effectively means by telephone.) 

Voice portals are made possible by the recent maturity of speech recognition and text-to-speech (speech synthesis) technologies.  Speech recognition allows callers to speak varied and complex requests that would be tedious or impossible using touch-tones.  And text-to-speech can be used to play rapidly-changing information back to callers without the time and cost of making voice recordings. 

A voice portal is the obvious choice any time you don’t have immediate access to the web, but do have a phone.  Although wireless phones now offer web access, the displays are mostly small and data entry is clumsy.  Voice offers an easier and richer interface.  And voice is often easier even when you can get to the regular web.  Consider this: you want a local weather report.  Do you boot up your computer, dial your ISP, open a web browser, select your favorite weather site, enter your zip code, and click “go”?  Or do you speed dial your favorite voice portal and say, “weather”?

As of early 2002, there were more than a half-dozen companies in the US offering voice portal applications or services.  AT&T Wireless recently inaugurated their #121 Service, using a portal from Tellme (try Tellme’s free voice portal: 1-800-555-TELL).  BeVocal sold their portal to BellSouth, Cingular Wireless, and Qwest.  Hey Anita provides a portal for Sprint PCS.  You can try Hey Anita’s free portal, too, at 1-800-44-ANITA. 

Yahoo offers a subscription service, Yahoo by Phone (1-800-MY-YAHOO), for a few dollars per month (see InformationWeek story).  So does America OnLine

Here we describe our experience with a prototype voice portal for a major telecommunications company.

 

       Business Models

Web portals make their money primarily through advertising, click-through fees, and sales of preferential placement in their search engine results.  Advertising is likewise possible with voice: ad’s can be short radio-like audio spots.  And the equivalent of click-through can be achieved by transferring the call to the advertiser’s 800 number

Voice portals can easily support searches within constrained categories like movies or restaurants, and preferential positioning at the beginning of results lists could be sold to advertisers.  However, true web-like searches aren’t yet possible; practical limitations of recognition vocabulary size preclude open-ended entry of key words or phrases. 

But to our knowledge, none of these revenue sources has been used in any voice portal to date.  Although they may well be viable in the future, voice portal providers today are pursuing two other business models: subscription services and no-extra-charge value-adds to regular telephone services.  As noted above, Yahoo and AOL offer fee-based subscription services to their members, while some telcos, mostly wireless carriers, offer them at no-charge as differentiators. 

The project discussed here was a no-charge, value-added service. 

 

       Practical Challenges

Early on, there were those who thought voice portals could just be voice-based versions of web portals.  But it’s not that simple.  There are innate differences between the two mediums in how information is entered, accessed, and presented.  Voice is a medium of speech and sound, while the web excels at text and graphics.  Voice is more suited for applications where requests can be spoken in a few words, and information can be read back in chunks of no more than one or two sentences at a time.  Charts and maps obviously don’t lend themselves to voice.  But things like traffic reports and driving directions work fine.  In situations like these, voice systems can be even better than web browsers because you can get the information while driving. 

 

Content

The content of greatest interest to voice portal users will likely be at least somewhat different from what’s most popular in web portals simply because users are likely to access the system in different situations, most notably from mobile phones.  So content should be chosen based on both users’ desires and what’s practical given the technology.  Weather reports and news briefs are naturals for voice, for example.  But more lengthy news articles must be played with synthesized speech, as recordings would be too expensive.  People might not be interested in hearing them over the phone anyway.  A good rule of thumb might be: if it works on the radio, it’ll work for a voice portal. 

And don’t forget content quality.  Technical and design wizardry is only a means to an end.  The concept will fail if the content isn’t accurate, timely, and useful.  This writer once called a voice portal (not the one discussed here) to get some information about the traffic jam where he was stuck.  According to the traffic report, there was no problem at his location.  Not much help!

Today’s consumers also expect their services to be slick and professional.   Voice portal vendors have realized that great production values make a big difference in acceptance and repeat usage.  They have creative staffs that employ professional voice talent, music, and sound effects to craft a “sound and feel” that adds tremendously to the service’s appeal.  In our view, the ultimate voice portals will be like entertaining radio channels where each user interactively controls the type and timing of the content provided. 

 

User Interface

Another big difference between web and voice portals is the user interface.  With voice, there’s only one train of thought at a time, so each transaction must be in the form of a sequential “dialog”.  The fact that graphical and voice user interfaces are so different means that It’s unreasonable to think you can build a voice interface to a graphical web page.  In most cases, separate applications are needed for each medium, although the same back-end data sources can provide the content. 

A major focus of voice user-interface design is site navigation.  The challenge is to keep users aware at each point in the dialog of what their options are and how requests should be phrased.  The navigation structure and allowed phrasings must be conveyed to the user while the dialog is in progress—often not an easy task, especially for complex applications with rich information content. 

One navigation approach that’s received a lot of attention is “natural language”.  Natural language allows callers to speak in extended phrases and sentences.  The goal is to collapse hierarchical menu structures and reduce the number of interactions needed to get any particular information.  For example, take a menu-based dialog for a weather report:  System: “What now?”  Caller: “Weather.”  System: “What city do you want?”  Caller: “Chicago.”  System: “Do you want the report for today or tomorrow?”  Caller: “Tomorrow.”  An equivalent natural language request might be, “What’s tomorrow’s weather for Chicago?” 

A subsidiary goal of natural language is to make the system seem friendly and easy to use.  So it might encourage and accept requests like, “Can you tell me what the weather will be tomorrow in Chicago, please?” 

Natural language can tax the capabilities of the speech recognition engine because picking the right word sequence out of the huge number possible in even a modest-size grammar is a difficult task.   So error rates tend to be higher than when recognizing one or two-word responses. 

Even more problematic is the user-interface design: how can you inform users of the large, but still constrained, set of phrases the system can understand?  Although the recognition engine can be very powerful, the technology still isn’t up to the task of understanding completely free-form speech. 

 

Maintenance

One final, but very important, challenge is the maintenance effort needed to keep the system current.  With regard to voice functions, maintenance is needed when information content changes.  For example, IPO’s, mergers, and bankruptcies require additions and deletions of company names for stock quotes.  These changes are needed on both sides of the dialog: in the speech recognition vocabulary, which determines which words can be understood (e.g., “Intel”), and the “prompts” that are played back to the caller when the information is furnished (“Intel.  31.62, up .31”).  (Actually, you’d want to delay deletion of vocabulary for some time to allow its recognition for users who don’t know about the change.  Speech recognition systems can’t understand words that aren’t in their vocabulary—they’ll just make mistakes.  So when someone requests a quote for “Enron”, rather than the system dumbly responding, “I didn’t understand”, it would be better for it to say, “Sorry, Enron is no longer traded.”)

Most speech recognition vendors provide tools to update vocabularies, and the task is straightforward, except perhaps for very large vocabulary sizes (tens of thousands of words) where special tuning may be required.  A more problematic issue is prompt creation.  A human voice produces the most intelligible and pleasing result for callers.  And it’s very desirable to keep the voice consistent within each section of an application, since switching voices between or within prompts can be jarring and confusing.   One option is to hire a voice talent to record prompts for the application, and then keep him or her available to record updates as needed.  But recording sessions can be time-consuming and expensive.  And in this approach you have to maintain a long-term relationship, if possible, with the voice talent who did the original prompts. 

Computer synthesized speech (also known as text-to-speech) is a very simple alternative to recorded prompts.  Maintenance is a snap—just add or change the text to be played, and it’s automatically generated when needed.  Unfortunately, although continuing improvements are being made in text-to-speech products, current offerings still have a somewhat disjointed or robotic sound quality that can degrade intelligibility, and some people may find the “voice” unpleasant.  Synthesized speech makes sense when the volume of material and frequency of updates makes recordings impractical.  A good example is a personal e-mail reader. 

 

       Solution

This project was a technology and marketing trial by a major telecom company.  The system was first tested with focus groups, then deployed in two small US cities.  It offered stock quotes, sports, weather, national news, horoscopes, television listings, and other information. 

The service was not formally advertised, but after some time received several thousand calls per week.  All calls were recorded and analyzed for speech recognition accuracy, to determine the content requested, system navigation, words and phrases used, and callers’ ease of getting what they wanted. 

The content types offered were chosen based on earlier experience and market research.  Interestingly, the largest percentage of requests was for horoscopes, followed by stock quotes. 

One major emphasis was to create the easiest possible user interface.  The approach was to see how far “natural language” could be taken.   Callers were given a choice of using a hierarchical menu structure—a top level “main menu” with choices, “stocks”, “weather”, “horoscopes”, etc., and branches descending for each one—or natural language “shortcuts”, which allowed sentences like, “Can you tell me about the Marlins game?” and “What’s the horoscope for Libra, please?”  These requests were allowed not only from the main menu, but from lower branches and leaves of the menu structure.  So after hearing a weather report, you could immediately jump to the sports report for any team without returning to the main menu and descending the sports branch. 

This design gave an extremely flexible interface.  But in practice, virtually no one ever used natural language.  Even focus group participants, when encouraged to use natural language, tended to say one natural phrase at the beginning and then revert to sequential, menu-based navigation. 

Originally, many of the content sources were text-based, and the information was played with a synthesized voice.  Users found the voice robotic and hard to understand.  Later, all static prompts needed for navigation were recorded, and the frequently changing content—weather reports, sports, stock quotes, horoscopes, etc.—was obtained from outside vendors in the form of regularly downloaded recordings.  This arrangement provided people-pleasing recordings with very simple application maintenance.  Only occasional updates were needed in the static recognition vocabulary and prompt recordings for such items as company names (for stock quotes) and sports teams.  

Overall, the system was very successful.  More than 90% of caller requests were fulfilled correctly.  Callers got the information they wanted, usually on the first try.  The system appeared to have a significant number of repeat callers, and demonstrated the viability of the voice portal concept. 

 

       Lessons Learned

This system provided a wealth of experience in the design and operation of a voice portal.

First, content.  Like any media business, you want to determine which content is most valued by customers.  For voice portals, as with web sites, the simplest measure is the frequency with which a given content is requested. 

In this service, we found a wide variation in request frequency among the different content categories.  Horoscopes were the top draw, followed by stock quotes.  Only rarely did users ask for any sports or other information. We believe these results would have been different if the service had been promoted to specific user groups. 

We learned a lot about user interface design.  It was clear that, at least for now, people don’t use natural language.   They know that they’re talking to a machine.  They don’t know exactly what the machine will understand, and they’re wary of making a “mistake”.  And they feel foolish engaging in human-like dialog.   We believe that natural language dialogs will become more common and useful when designers and users converge on a style of interaction that users find easier and quicker than menu-based dialogs.  This style will probably feature minimal phrases conveying just the salient data: “weather for Chicago tomorrow,” rather than a more polite, “What will the weather be in Chicago tomorrow?”

Finally, the use of third-party recorded-content vendors was in retrospect a good decision.  It provided very easy maintenance with content in a form users found appealing. 

 

© 2002 VoxMedia Consulting Inc.

 

 

Back to Case Studies menu                                      Back to Top p