Abstract
In this paper, we consider a novel research problem, music-to-text
synaesthesia. Different from the classical music tagging problem that
classifies a music recording into pre-defined categories, the music-to-text
synaesthesia aims to generate descriptive texts from music recordings for
further understanding. Although this is a new and interesting application to
the machine learning community, to our best knowledge, the existing
music-related datasets do not contain the semantic descriptions on music
recordings and cannot serve the music-to-text synaesthesia task. In light of
this, we collect a new dataset that contains 1,955 aligned pairs of classical
music recordings and text descriptions. Based on this, we build a computational
model to generate sentences that can describe the content of the music
recording. To tackle the highly non-discriminative classical music, we design a
group topology-preservation loss in our computational model, which considers
more samples as a group reference and preserves the relative topology among
different samples. Extensive experimental results qualitatively and
quantitatively demonstrate the effectiveness of our proposed model over five
heuristics or pre-trained competitive methods and their variants on our
collected dataset.