next up previous contents
Next: Game theory Up: Generating and understanding dialogues Previous: Generating and understanding dialogues   Contents


Deciding dialogue strategies using policies

It is clear that human speakers do not derive a dialogue plan from first principles every time they must think of something to say. Often, situations arise that have been seen before, and the speaker needs only to recall his policy for the situation. For example, flight-booking dialogues would often have states in common. For many planning problems that involve long sequences of actions on problems with limited numbers of states, planning from first principles can be inefficient since the agent must search over several action alternatives for each of many steps in the plan. On the other hand, by using reinforcement learning, a learned policy provides a compact record of the solution to every problem instance. To design a system that uses a policy, the designer must first specify a set of states. For example, in a flight booking system that fills a frame of information from the user, the frame states would form the state space. For each state, the set of actions is defined, and a state transition function specifies the state that must result from applying an action in a state. Different actions could represent different strategies. For example, there might be a system-initiative strategy or a user-initiative strategy, or different confirmation strategies for different states. Learning a policy is a matter of evaluating utility for a state and an action by looking up the utility of the resultant state and adding any reward gained from the current one. By iterating this learning rule many times over the each state in the state space, a table of state action pairs and utility values converges on the optimal policy for the problem by passing back utility values from the outcomes of the dialogue. Using a policy, planning becomes a trivial matter of looking up the table to decide the best action to take in a state.

Using a policy has an advantage in that the true reward of dialogues can be used in training the system. On the other hand, by planning the dialogue, the reward obtained is estimated from the value of the goal and the costs accrued by each of the actions in the plan. However, the downside of using a policy is the cost of exploration to obtain training data. In reinforcement learning, an agent must strike a balance between exploration and exploitation. Using softmax selection [71], an agent can try actions that are not currently optimal, in order that examples can be collected to reinforce that action. As more and more examples are collected, the agent tends to exploit the optimal action rather than explore. Softmax selection is also useful in cutting down the complexity of the state space, which might be equivalent to the combinatorial composition of the states of many beliefs. One good example of using dialogue policies is that of Walker et al [76]. They use the PARADISE evaluation framework to compute the utility of the dialogues.

Using reinforcement learning is an attractive approach to deciding dialogue strategies, and it is important to contrast this approach with the planned approach that will be taken in this thesis, since both can adapt themselves to users by training on their dialogues. To be competitive with reinforcement learning, two qualities are important. First, the planner should be as easy to use as a reinforcement learning system, and in the next chapter this will be shown to be the case. The second quality is its performance, that is, the quality of the dialogues produced given a certain amount of training material. While reinforcement learning is very useful for problems with limited numbers of states, which are well covered by training data, planning is useful where there are many more states, leading to the problem of making a good decision in novel situations. The arguments for and against planning and reinforcement learning in robot planning carry over to dialogue planning, especially where the dialogue is a non-routine dialogue such as a meta-level negotiation over a robot plan.

The model used for basic reinforcement learning is that of the Markov Decision Process (MDP) in which the state transition function is assumed to be deterministic. Just as in robot planning, where actions and observations are uncertain, dialogue planning must accommodate uncertainty since errors occur in the speech recognition process. For this reason, several researchers have addressed the use of Partially Observable Markov Decision Processes (POMDP) in dialogue planning. In a POMDP, actions have a probability distribution of effects, and states result in a probability distribution of observations. Since the agent does not know which state it is in, reinforcement is more difficult, and POMDPs can be difficult to train. Roy et al [58] , used a POMDP to deal with speech recogniser uncertainty in a speech-controlled robot, showing a significant improvement in performance when uncertainty in the belief state of the robot is accommodated. Zhang et al [79] address the state complexity issue by using a Bayesian Network to map several state variables into one.


next up previous contents
Next: Game theory Up: Generating and understanding dialogues Previous: Generating and understanding dialogues   Contents
bmceleney 2006-12-19