next up previous contents
Next: Conclusion Up: Planning of Dialogue Previous: Dialogue management and user   Contents


Evaluating dialogue systems

The value of a dialogue system is determined by the purpose for which it was built, and there can be a wide range of these. Often, systems are task-based, in that there is a definite goal whose achievement is the system's main purpose. A task-based system is expected to achieve that goal in a manner that uses few resources. Since dialogue acts only consume the time of the system and the user, the objective is really just the time taken to execute the dialogue, balanced against the reward obtained by achieving the goal. Task completion and execution time are both objective measures, that can be determined by the system itself, allowing for self-training or planning with these measures as the objective. Happily, task-based systems are the kind that are examined in this thesis. Apart from execution of a task, dialogue systems can be built for other purposes. For example, the well known Eliza system that attempts to simulate a psychiatrist, is not related to any task, yet users seem to appreciate the system in producing intelligent dialogue. Similarly, a system to produce weather forecasts would need to be evaluated by a mix of objective measures and human judgement, in producing informative, error free, and pleasant dialogue. Without a task model, it is more difficult to tell why the system is of use to the user.

Unfortunately, human judgement is often the final word in acceptance of a dialogue system, and it is often not perfectly determined by objective measures of the system. Even when there is a relation with objective measures, it can be irrational. For example, a system that exhibits frequent speech recognition errors due to the use of a freer user-initiative dialogue strategy might be perceived as of poor quality, even though the dialogue strategy improves the execution time of the system. As a result, human judgement, which is comparatively expensive, must often be used in combination with objective measures.

One attempt to relate objective measures to human judgement is the PARADISE framework [74]. This framework uses a set of objective measures of the dialogue, such as task completion, execution time, response time, and number of errors. In user trials, a quantitative measure of performance is obtained from human judges. It is then supposed that this performance quantity is a weighted sum function of the objective measures. Using the set of user trials, a weighted sum of the objectives is equated to the judge's value. The set of weights that minimises error over the trial data is then obtained. It is interesting to look at the weights obtained, and this was done for two dialogue systems. It was found that task completion, response time, and elapsed time for the dialogue generally obtained the greatest weights. These experiments also showed that users irrationally value recognition accuracy over elapsed time, even though they ought to value only task completion and elapsed time.

Where suitable objective measures can be found for a dialogue system, automatic training of the system becomes possible without the need for human judgements. For example, a reinforcement learning system would be able to train on dialogues with users by calculating the measures at the end of each example dialogue. For a planned approach to dialogue, there is a further requirement of the objective measures that they be compositional over the plan structure, in that the measure for a plan is equal to the sum of measures for the acts in the plan. This is because a planner, in contrast to reinforcement learning, searches for plans, rather than reinforces plans that appear in the training data, and so that planner must be able to predict the value of the plans in the search space. This could be a disadvantage, but the measures given in the previous paragraph are compositional in that they are additive over the acts in the plan structure. Task completion is a function of the plan structure that is obtained at the close of the dialogue. Response time is a function of act that the system chooses at its turn. Elapsed time is the sum over the system and user acts of the time taken to execute each dialogue act. This might be easy to predict from measures such as the number of words used in the utterance that corresponds with the act, or from recorded examples of the act's use in a real dialogue.


next up previous contents
Next: Conclusion Up: Planning of Dialogue Previous: Dialogue management and user   Contents
bmceleney 2006-12-19