Abstract
We present an architecture for integrating real-time, multimodal input into a
computational agent's contextual model. Using a human-avatar interaction in a
virtual world, we treat aligned gesture and speech as an ensemble where content
may be communicated by either modality. With a modified nondeterministic
pushdown automaton architecture, the computer system: (1) consumes input
incrementally using continuation-passing style until it achieves sufficient
understanding the user's aim; (2) constructs and asks questions where necessary
using established contextual information; and (3) maintains track of prior
discourse items using multimodal cues. This type of architecture supports
special cases of pushdown and finite state automata as well as integrating
outputs from machine learning models. We present examples of this
architecture's use in multimodal one-shot learning interactions of novel
gestures and live action composition.