Abstract
Natural language descriptions of visual media present interesting problems for linguistic annotation of spatial information. This paper
explores the use of ISO-Space, an annotation specification to capturing spatial information, for encoding spatial relations mentioned in descriptions of images. Especially, we focus on the distinction between references to representational content and structural components of images, and the utility of such a distinction within a compositional semantics. We also discuss how such a structure-content distinction within the linguistic annotation can be leveraged to compute further inferences about spatial configurations depicted by images with verbal captions.