Abstract
As scientific challenges become increasingly complex and data sources more diverse, there is a growing need for learning systems that move beyond purely data-driven meth- ods. In domains such as chemistry and materials science, where data can be scarce, heterogeneous, or deeply tied to domain expertise, conventional models often struggle to produce accurate and interpretable results. This dissertation addresses these chal- lenges by investigating unified scientific representation learning through the integration of multimodal data and relational similarity learning. It explores three interrelated di- rections that reflect a progression from unimodal analysis to full multimodal fusion. The first part focuses on predicting spatial orientation using image modalities, demonstrat- ing how pixle information can guide models in learning directional patterns. The second part centers on multimodal alignment in chemistry, aligning molecular graphs with spec- tral using contrastive learning to ensure consistent and enriched representations across data types. The third part introduces a framework for multimodal fusion in molecular property prediction, showing how integrating multiple modalities within a graph-based architecture captures both local and global relationships critical for generalization. Col- lectively, these contributions advance the development of flexible, modality-aware rep- resentation learning systems that improve robustness, interpretability, and predictive performance in scientific machine learning.