Abstract
The ability to understand and model human-object interactions is becoming increasingly important in advancing the field of human-computer interaction (HCI). To maintain more effective dialogue, embodied agents must utilize situated reasoning-the ability to ground objects in a shared context and understand their roles in the conversation [35]. In this paper, we argue that developing a unified multimodal annotation schema for human actions, in addition to gesture and speech, is a crucial next step towards this goal. We develop a new approach for visualizing such schemas, such as Gesture AMR [5] and VoxML [33], by simulating their output with VoxWorld [21] in the context of a collabo-rative problem-solving task. We discuss the implications of this method, including proposing a novel testing paradigm using the generated simulation to validate these annotations for their accuracy and completeness.