Final Report
Contents:
1. Executive Summary
2. Background and Research Objectives
3. Summary of Experimental Research Findings
4. Discussion
5. Further Research
6. Conclusions
7. References
This 11 month project investigated the impact of dialogue constraints on both automatic speech recogniser performance and human behaviour. A key conclusion from the research is that human behaviour with highly constrained dialogues is predictable enough to allow the development of models of human and system behaviour which can be used in the early stages of design. Engineering models of this kind have the potential to produce savings in development time for new speech dialogues and to allow the development of faster human-computer interactions. However, human behaviour is much less predictable when the constraint in the dialogue is implied rather than explicit. Further research is needed before the problem of modelling human responses with these dialogues will be tractable. The current report provides details of the main experimental findings, the key implications for design and suggests some further research which could be performed in order to build upon the current work and fully realise its value.
Talking to a computer should be easier and more natural than using manual input, but two features of automatic speech recognition devices make it more difficult. First they make errors. Second they constrain what can be said. These two issues are closely related: other things being equal, the greater the level of constraint, the more accurate the recognition performance. However constraint can also create barriers between the user and the achievement of their task goals as they must first wait until the option they require is available within the active vocabulary of the system; then they must specify their request in terms which can be understood by the machine; finally success is dependent upon the system correctly recognising that utterance. The current work was concerned with the impact of dialogue constraints on human-system performance, both through its effect on machine error rate and through its effect on user behaviour. This was investigated via a set of experiments which used real interactions between users and speech systems. The aim of the project was to produce results which can be used to aid the designers of speech-based systems during the early stages of design. In particular it was hoped that the results would contribute to the development of a method for predicting the effects of design decisions about constraint. This aspect of the research is particularly important, for unlike the results of the experimental trials themselves, the output will not be limited to the particular technology used for the research. Keeping up with the latest technological developments is a major challenge for human factors, with the results of empirical studies often in danger of arriving too late to be of use to designers. The use of a modelling approach is advantageous because the parameters representing system performance (e.g. recognition accuracy, vocabulary size) can be updated as necessary, and predictions obtained despite technological advances.
Several levels of constraint were investigated during the current research.
These were embodied through query style dialogues, menu style dialogues
and yes/no style dialogues. Examples from each of these classes are
shown in table 1.
Mode | Example Prompt |
Query | Which service do you require? |
Menu | Which service: balance, cash transfer or other? |
Yes/No | Hear your balance? |
Waterworth (1984) argues that these voice data entry modes are ordered along a dimension ranging from most sophisticated (in terms of recognition technology needed) and least explicit (in terms of how the user is informed of the system’s requirements) at the top, to least sophisticated and most explicit at the bottom. He also argues that there is a parallel hierarchy of size of acceptable user response class (with smallest at the bottom, largest at the top). The importance of these parallel hierarchies becomes clear when one examines the performance implications. In general recognition performance will improve as you move from the top to the bottom of the hierarchy (as the vocabulary size falls). This should act to increase dialogue efficiency due to the reduction in time consuming error correction moves. On the other hand the restrictions on what can be said tend to introduce extra steps into the dialogue. This is time consuming in itself and also brings more opportunities for recognition errors. Finally there is the question of how users respond to the different levels of constraint, in particular whether they are able to limit their utterances to those that the system can understand. If they cannot there are serious implications for performance. Overall there are clearly trade-offs inherent in choosing dialogue constraint level. The research reported here investigated these trade-offs.
Two experiments were carried out to investigate the relative efficiency of yes/no, menu and query prompting strategies (see table 1). While this issue has been investigated in a previous study by Brems, Rabin and Waggett (1995), their research has several key shortcomings which limit the applicability of their results. Firstly Brems et al (1995) relied on a Wizard of Oz simulation of the speech technology. This approach allowed them to simulate the rejection of non-vocabulary items by the recogniser, but not the differing recognition accuracies which would be associated with the different vocabulary sizes used. Secondly the query and menu strategies in Brems et al's (1995) work were used to accomplish different types of task to the yes/no strategy, so their research did not allow direct comparison of all three approaches. The current work used an actual ASR device rather than relying on a simulation of ASR capabilities and all three styles of interaction were used to accomplish exactly the same task to allow direct comparison between them. The home banking domain was chosen for the experiments as an example of a real-world application of speech technology. Performance was measured in terms of success in task completion, transaction time and user satisfaction ratings.
In the first experiment forty two novice users were tested with the three dialogue styles (a between subjects design meant that each participant only used one dialogue type). It was found that participants were least successful with the query strategy (lowest constraint level), completing significantly fewer sub-tasks than with either the menu strategy or the yes/no strategy. In fact only three of the fourteen participants who used the query strategy were able to complete their interactions with the device. In terms of transaction times it was found that the successful yes/no interactions were significantly faster than the successful menu interactions (there was insufficient data to compare the successful query interactions). This result was initially surprising given that the yes/no interactions involved significantly more individual steps than the menu interactions. However, it can be explained by three key facts:
In the second experiment the dialogues were tested using twelve trained users (a counterbalanced within-subjects design was used so that each participant used all three dialogues). Users were given training with the speech recogniser and cue cards showing the words and phrases that the system could accept. Under these conditions all participants were able to complete all of the sub-tasks. Comparisons of the transaction times showed that the query strategy was fastest, the yes/no strategy next fastest and the menu strategy was slowest (all post-hoc pairwise comparisons significant at p<0.05). The comparison of the yes/no and menu strategies supports the results from the first part of the experiment. The faster transaction time with the query strategy had previously been predicted using a task network modelling approach to simulate both the human and system responses (Hone and Baber, In Press).
3.2 The Role of Goals in User Interactions with Speech Systems.
Previous research (Hone et al, 1998) had shown that users of a speech input-visual output system used words which were not currently available in the system vocabulary, even when the full list of available words was displayed as a menu on the screen. This behaviour could be explained by a conflict between the dialogue structure at that point and the user's task goals. It was predicted in the current research that users would make similar errors in using the telephone banking systems. However throughout the two experiments reported above there were no examples of users attempting to bypass system-imposed constraints. In contrast there was actually evidence of users following the system imposed dialogue structure even where this was potentially in conflict with their own task goals. In the task instructions users were told to transfer cash and then to obtain a statement. However, in the yes/no and menu dialogues the statement option was presented before the transfer cash option and many users did the tasks in this order. This tendency was more pronounced in the highest constraint conditions (yes/no dialogues).
Given these findings from the first two experiments, a third experiment was designed to investigate further the factors which influence whether people obey the constraints implied by system prompts, most importantly the role of user goals. The first aim of the experiment was to test the limits of the finding of no attempts to bypass dialogue constraints from the initial studies. This was done by deliberately using an ambiguous prompt at key dialogue points to encourage this type of user error behaviour. It was predicted that users would be more likely to try to bypass system constraints when the perceived cost of following those constraints was high. This cost was manipulated through the prompt type (menu or yes/no) and through task instructions which required users to cycle through the whole dialogue structure either once or twice. The second aim of the experiment was to investigate further the phenomenon of users following system imposed dialogue structure in preference to their goal order. In the first experiments where this effect was observed it was conceivable that task order would affect task outcome (a statement requested prior to cash transfer may be different to one requested after), however this was not clear to users. In the third experiment the salience of task order to goal achievement was manipulated in two ways:
Forty eight participants took part in the experimental trials. A mixed factorial design was used (with instruction type and dialogue style as the between-subjects variables and task type as the within-subjects variable). The first major finding was that despite the deliberately ambiguous prompts at key parts of the dialogues, no users tried to bypass the implied constraint and ask directly for the next service. This suggests that verbal prompts may provide a stronger cue for constraint than equivalent prompts presented visually. The results also confirmed the prediction that increased dialogue constraint (through the use of the yes/no as opposed to the menu) increased the likelihood of users following the dialogue imposed task order rather than that provided in their instructions (user goals). This tendency was reduced by the use of instructions stressing the importance of task order for goal achievement, but not entirely removed for the yes/no dialogue style. There was no effect of task type (home banking vs. weather information) on user behaviour.
Further details of this experiment are presented in Hone and Golightly (In Press).
The main guidelines which arise from the experimental studies are as follows:
4.2 Proposed Generic Method for Specifying Constraint Levels in Speech Dialogues
The relationship between dialogue constraint and efficiency will be influenced by a number of factors including number of available task options, recognition accuracy, vocabulary size, length of individual prompts, error recovery route and user behaviour. These factors have been incorporated into task network models with the aim of predicting outcome when these variables are altered. In Hone and Baber (In Press) we use this method to predict the transaction time effects of two different levels of constraint (menus and query dialogues). The main prediction in this paper is that query dialogues can be faster than menu dialogues, even if they have lower recognition accuracy, because they can be completed with fewer steps. This prediction is supported by the results from the second experiment reported above where trained users interacted with both query and menu dialogues, and were significantly faster with the former.
Unfortunately the prediction of the modelling approach was not supported by the results of the first experiment reported above, where users had not been previously trained to use the system. Here very few users were actually able to complete the transaction at all using the query dialogue. The crucial factor here was user behaviour. In the modelling work users were assumed to say a valid input and, if this was rejected by the machine, to repeat that same input until it was correctly accepted However, real user behaviour did not follow this pattern. Instead it was observed that novice users frequently misinterpreted rejections of valid vocabulary items by the machine as implying that they had used vocabulary which was not acceptable. This inappropriate model of the machine led them to attempt error recovery by rephrasing their input, a strategy which was on average less successful than simply repeating the original input. The assumptions about user behaviour used in the current task network models are therefore not appropriate for predicting the transaction times of novice users with query type dialogues and lead to overestimates of the degree of success of these users with this interaction style.
In contrast to user behaviour with the query dialogue, user behaviour with the menu and yes/no styles of interaction was highly predictable. Unlike previous work using visual system output (Hone et al, 1998), there were no instances of users making errors in trying to jump ahead of the system imposed constraints. This was true for both novice and experienced users, for different task types (banking and weather) and despite attempts to encourage such behaviour through the use of ambiguous prompts at key dialogue points. In addition it was observed that there was a strong tendency for users of menu and yes/no dialogues to perform tasks in the order in which they are presented in the dialogue structure. These findings provide evidence of a close correspondence between real user behaviour and the current task network modelling assumptions. This is a very positive result meaning that the task network modelling approach could be used in order to make early design decisions about dialogues using the yes/no and menu style. One such design decision might be choosing between the yes/no and menu styles for a particular application domain. Another important design decision where the approach might be relevant is in determining the order in which services are offered within a dialogue in order to optimise efficiency.
The current research suggests that task network modelling will be a productive method for predicting the transaction times of speech dialogues using spoken yes/no and menu style prompts. It is suitable for these dialogues because of the high degree of predictability shown in user behaviour with these prompts. In order to make the most out of this potential, further work is needed developing the task network modelling approach specifically for the speech technology domain. In particular it would be desirable to automate the assignment of transaction times to individual prompts. Ideally designers would use a dedicated tool in which all they would need to do is draw a flow chart of their proposed dialogue structure, typing in the prompts and expected user responses at each step. Such work would simplify the use of task network modelling and encourage its use by designers.
The modelling approach has also been shown to be valid for modelling query style dialogues, but only where users already know the available system vocabulary. While this is very useful, there is also a need for predicting outcomes for users who do not know the vocabulary. In this respect the current research has highlighted deficiencies in the current modelling assumptions regarding novice user behaviour. In particular users seem to engage in an active problem solving approach, trying different formulations of their utterances following a system recognition failure. Further research is needed on the factors which influence these types of user responses, and to determine the extent to which they can be predicted. Such research would aid the development of modelling approaches with wider applicability than the current version.
Finally the results raise some interesting issues about user behaviour. It had been predicted that users would try to bypass dialogue constraints where these were in conflict with quick achievement of their goals. While error behaviour of this kind was not observed in the current studies, the behaviour which was observed can also be attributed to users' attempting to optimise their path through the dialogue. Thus users showed a tendency to respond to each opportunity to satisfy their goals in the order in which they were presented. This behaviour was most pronounced when the dialogue structure was least flexible (the yes/no style) and the perceived cost of re-navigating the dialogue for earlier items was therefore highest. This explanation fits in with the rationalistic framework proposed by Anderson (1990). Further research could be conducted into this effect, with task network modelling being used to predict actual dialogue navigation costs. Such research could provide data on the relationship between real and perceived dialogue navigation cost, and allow more accurate prediction of which dialogue features most influence the path which users follow through the dialogue.
The research has provided guidelines which are applicable to the design of small vocabulary speech input-output systems. It has also considerably furthered the development of predictive models which have the potential to be applicable to the design of speech systems in general. A number of avenues for further research have been suggested which could build upon the current findings and fully realise their value.