Next: Amenable problems Up: Evaluation Previous: Implementation Contents

Evaluation method

The approach taken to evaluate the planner is one of dialogue simulation, rather than trials with real users. Each approach has different advantages. User trials are typically much better than simulation since they provide concrete evidence of the performance of the system in its intended setting. Unfortunately, they are relatively laborious to produce, taking hours of time to produce perhaps a dozen dialogues. To produce statistically significant results hundreds of trials may be necessary. Using a stereotype belief model, the planner is supposed to adapt to random distributions of belief states. The problem then arises of obtaining many different subjects so that a suitable distribution of stereotype values is obtained. Ideally, several runs of training and testing should be used, with each run involving many dialogues and each producing a different stereotype value - different settings would have different characteristics, with different distributions of belief states among the subjects. With human trials it would be difficult enough to investigate just one dialogue problem with one distribution of users, producing just one stereotype value. The advantage of simulation is that it allows dialogues to be generated for many different stereotype states, allowing coverage of all sorts of user populations and different problems. Many of the experiments conducted for this thesis examined around one hundred different stereotype belief states to obtain sufficiently detailed results. Such detail would have been impossible with human trials. The main disadvantage of simulation is that the simulation model of the user could be incorrect, leading to incorrect results. On the other hand, this is not a problem in human trials, once enough data is available to ensure statistical significance. Simulation treats the human user as an ideal decision-maker. It is well known that the performance of most humans is less than ideal [20,68,75], and so in using simulations, there is a missed opportunity to discover and accommodate the human decision-making process. There is no assurance that the characteristics discovered about the planning problems transfer to the human setting, nor that the efficiency claims about the planner transfer. On the other hand, experiments have been conducted to investigate human communicative choices in game-theoretic problems, for example, exchange of information in a war exercise [25]. The human choices were found to be close to the ideals chosen by the computer. The simulation approach also has some popularity in developing dialogue systems, being used by [30] and [36] in studies of initiative in dialogue, and by [64] in evaluating different reinforcement learning systems on a simulated user. The simulation method is quite easy to use, since it uses the planner to compute the strategies of both the user and the system. This being so, the game tree generated by the planner can be used to represent the simulation outcomes, because the planner already simulates the user to generate the game tree. The game tree can then be used to explore the dialogue outcomes obtained from different belief states.

The main differences between a simulated dialogue and one with a human participant relate to the bounded reasoning resources of the user. Some of the game trees used in the examples are quite deep, and the dominance of different strategies is closely related to the probability values in the belief model. A human decision maker could not be expected to compute the game tree with such depth or precision. A second problem is that human users tend to fit themselves to an interface by reinforcing routine responses over time. If the system responds to a changing belief model by changing its strategy, there may be a period thereafter where the user performs badly as he tries to learn a new policy to fit the new strategy. For example, most users would be upset if the menu options or dialogue sequence available to them on a graphical user interface were to change from the that which they have become used to.

The planner is required to show an advantage over systems that have no user model. It is clear that a system with no user model will not change its strategy as the user or population of users changes. Therefore, the only way to design such a system is to make a good choice of strategy at design-time, and fix it for the lifetime of the system. For example, if there is a choice between two strategies, the planner is compared with the best of these as the system's fixed strategy. It is difficult though to quantify the gain, since if the belief model does not change much during the lifetime of the system, then the system does not need to change its strategy and so there is no gain over one of the fixed-strategy planners. It is only when the belief model drifts across the decision surface that a gain is obtained. Without user data the path that the belief model takes cannot be known, and so the gain obtained over the system's lifetime is not clear.

One of the stated objectives of the system is to provide an efficiency gain over current dialogue systems using a finite state or frame-based dialogue manager. It is not hard to translate the plan rules of such planners into equivalent hierarchical plan rules of the sort used by the fixed-strategy planner. Each parent in a hierarchical rule can be used to represent the state, the first child the output of the system, and the second child the next state. Conversely, the fixed-strategy form of the examples given in this chapter have a finite state equivalent, since each parent can be written as a state, with alternative decompositions represented by alternative edges and next-states. The preconditions must of course be ignored in performing this transformation. Due to this equivalence, it can be said that a comparison with finite-state systems is being made.

Next: Amenable problems Up: Evaluation Previous: Implementation Contents

bmceleney 2006-12-19