Abstract
Many tasks can benefit from the use of multimodal resources, such as vision-languagedatasets, which include images along with textual pieces of information. A common practice is to combine both the visual and textual representations into multimodal models, yet this approach is expensive and not necessarily essential. This work examines the way in which the textual-only parts of visual datasets, which we call “visual corpora”, capture the visualizability of language components, and makes use of their unique properties to solve related NLP tasks. We demonstrate the power of visual corpora in solving tasks like determining the concreteness level of words and phrases and the task of metaphor detection. Relative to multimodal models, working with only textual data has the clear advantage of a lower time complexity, as well as a lower risk of introducing noise originated in visual data.