Abstract
In this paper, we introduce an architecture for multimodal communication between humans and computers engaged in a shared task. We describe a representative dialogue between an artificial agent and a human that will be demonstrated live during the presentation. This assumes a multimodal environment and semantics for facilitating communication and interaction with a computational agent. To this end, we have created an embodied 3D simulation environment enabling both the generation and interpretation of multiple modalities, including: language, gesture, and the visualization of objects moving and agents performing actions. Objects are encoded with rich semantic typing and action affordances, while actions themselves are encoded as multimodal expressions (programs), allowing for contextually salient inferences and decisions in the environment.