Abstract
The demand for more sophisticated natural human-computer and human-robot interactions is rapidly increasing, as users become more accustomed to conversation-like interactions with their devices. This requires not only the robust recognition and generation of expressions through multiple modalities (language, gesture, vision, action), but also the encoding of situated meaning: (a) the situated grounding of expressions in context; (b) an interpretation of the expression contextualized to the dynamics of the discourse; and (c) an appreciation of the actions and consequences associated with objects in the environment. In this paper, we introduce VoxWorld, a multimodal simulation platform for modeling human-computer interactions. It is built on the language VoxML, and offers a rich platform for studying the generation and interpretation of expressions, as conveyed through multiple modalities, including: language, gesture, and the visualization of objects moving and agents acting in their environment.