Abstract
This work investigates whether open-ended prompts can evaluate large language models (LLMs) understanding of word meaning in context, focusing specifically on polysemy and homonymy. LLM generation capabilities are certainly well know, but it still unclear whether they understand the choices that make and why one generation is given over another. Traditional evaluations of LLMs often rely on fixed-alternative benchmarks that risk measuring memorization rather than semantic competence. The open-ended prompting approach, this thesis proposes, consists of probing LLM knowledge across four tasks designed to test word sense discrimination, definition generation, and contextual substitution. Using a subset of noun senses from WordNet, responses were collected from four instruction-tuned open-weight models—LLaMA 3.2B, Mistral 7B, Gemma 4B, and DeepSeek R1—and evaluated using the same models in an LLM-as-a-Judge framework. Results indicate that while models like Gemma and Mistral outperform others in terms of task performance and consistency, significant biases emerge in judgment behavior, including self-enhancement and format preferences. This illustrates not only how word meaning knowledge but also artifacts from training. These findings underscore the complexity of evaluating meaning understanding in LLMs and raise important questions about the reliability of LLM-based evaluations. The study concludes with a discussion of methodological implications and outlines future directions, including embedding-based analyses and the exploration of reasoning trajectories in LLM outputs.