In Section 2.9.1, reinforcement learning was discussed as an alternative to planning of dialogues. Both approaches choose a dialogue strategy by training on dialogue data, and as a consequence it is important to compare them. It might be argued that reinforcement learning eliminates much of the complexity involved in using plan rules, by defining the dialogue form using states and state transitions rather than a mental state and plan rules. Learning a policy is then straightforward, and seemingly is the best that can be done with the available dialogue evidence. However, it has been shown in Chapter 4 that example dialogue problems can be specified quite easily with the planner. To compete with reinforcement learning, it is only required that the planner can produce better dialogues, given the same amount of training material.
Reinforcement learning ignores much of the information that should be shared between states. To take an example, consider the flight booking problem given in Chapter 4, and the game tree for the problem given in figure 4.15. Notice that the precondition intend(book-flight-window) appears at many different choice nodes in the game tree. This means that in many different states, the agent would have the opportunity to gather evidence for that precondition. On the other hand an MDP system would not allow such sharing of information between states. It would only train on dialogues whose path includes that state. Effectively, the planner is taking the training data from three different states where the MDP system is only using one. It is not hard to think of other examples. For instance two robots assembling a car would quickly learn the applicable and inapplicable plans of each other by inferring preconditions. For example, if robot 1 sees robot 2 use tool for task , and having tool is a precondition to task , robot 1 has already got some evidence about the applicability of task , without having experienced an instance of task . Reinforcement learning on the other hand would only learn about task from direct evidence of the execution of task . For problems like this in which agents must try plans they have never seen tried before, planning rather than reinforcement learning is appropriate.
The question of which approach is the better one will often depend on the amount of training data available. If training data is inexpensive, or if there is a small number of states, an MDP may be the better choice due to its simplicity. On the other hand, if training data is expensive, the planner may well be the better candidate. Often, training material is obtained by running the system with real users, and so the training expense does come into play.
It is proposed that an experiment be carried out to compare the two approaches, using a suitable practical example, perhaps the flight booking problem seen in example 1 of Chapter 4. This could be carried out as a simulation experiment, using randomly generated belief states input to the planner to produce artificial dialogues. These dialogues would provide training material for a planner based on reinforcement learning and the ordinary planner. Each would be evaluated in a simulated dialogue against the ordinary planner. The experiment might show that the performance of the reinforcement learning approach is worse than that of the planner. Better still than this, training and testing could be done with real users, especially as the planner might perform particularly well if the dialogue partner is another instance of the planner. It would be important to choose a good distribution of problems, since each style of planning is suited to a particular kind of problem.