Stolke et al [70] showed that information about dialogue context can make a slight improvement in speech recognition performance. Their approach was to use dialogue act classification based on n-grams to predict the dialogue act class of the next utterance. Once a distribution over classes for the utterance is predicted, a language model is determined by mixing, in the proportions of the distribution, language models for each act type. It may be possible to apply a similar method by using expectations generated by the planner to obtain a model of the user's expected utterance.
Here is an outline of the mechanism that could be used. A typical speech recognition system consists of a hidden markov model (HMM) acoustic model combined with a language model for word sequences that determines the probability of the given signal given the dialogue act . The planner estimates the belief state, , and through the planning mechanism, the conditional dependence of the dialogue act on the belief state . Therefore, the conditional dependence of the utterance signal can be related to the belief state , if an model can be provided for producing the signal from the act. To perform belief revision, the conditional dependence of the belief state upon the utterance signal is required . This can be computed using Bayes rule as follows:
(6.1) |
The only unknown in this formula is P(U), but since it is invariant with the belief state, it cancels out when the relative probabilities of belief states need to be found.
As an example, consider a belief state in which the hearer believes that the speaker intends to have the red ball at , and the blue ball at , and that there are only two signals "red" and "blue". Suppose the system picks up the signal "red". might be , , , . The revised belief would then be and . Notice that unlike before where beliefs were revised to 0 or 1, this revision reflects the error and ambiguity in the meaning of the signal.
In a noisy environment, the planner would be able to respond to such weak revisions by using clarification subdialogues based on value of information, with cut-off points for clarification appearing near 0 and 1. Choices could be made between more and less risky types of signal, for instance, "pass the red ball" would have a higher probability given intend(red), than just saying "red". The planner might go so far as to test different strategies by generating a speech signal using text-to-speech, and then feeding it straight into the speech recogniser as belief revision is performed at the next level in the game tree. This would be useful for predicting the amount of information the subdialogue is likely to provide, so that the value of information can be calculated.
In the spirit of the dry-land algorithm where a search is made for the belief state that maximises the probability of an observed act, the planner could be used with an acoustic model to search for the belief state that maximises the probability of the given signal. This approach throws away the remainder of the "n-best" list that is typically used in speech recognition, which could be useful on subsequent turns, but on the other hand it is a useful simplification that conforms with the current design of the planner.