Abstract
Scene recognition is a task in computer vision that aims to understand images at a higher level than the more well known task of object recognition. With recent advances in GPU-based learning and large-scale data sets, neural networks have been able to learn representations directly from images that result in very low error rates on data sets with hundreds of scene categories. Scene understanding is a crucial step towards human-like visual knowledge, as it contextualizes object recognition, a task on which modern systems have surpassed human performance.\r \r This thesis adapts scene recognition from vision to language. We leverage image data to provide a visual semantic grounding for text in order to build a corpus for event localization, a novel classification task that aims to learn where the events described by sentences take place. First, we describe the annotation methodology used to construct the corpus and the steps taken to validate the data. We describe the most important characteristics of the corpus, and finally, present results from several classification models, both feature-based and neural, some of which achieve nearly 70% accuracy on the 13-way classification task presented here.\r \r Our experiments show that training on the Event Localization Corpus allows classifiers to predict event locations accurately even when information about the location is not explicit in the sentence.