Abstract
While research has been done in the fields of automatic summarization and automatic categorization, none of it has focused upon categorization of speaker sentiments. In this thesis, I attempt to fill this gap. For the purpose of providing a usable framework of speaker sentiments for the Talmud, I implement a tagging system centered upon using machine learning on human annotations and apply this knowledge to tag segments of the Talmud for their basic purpose, speaker, target, and point of reference within a longer statement. For a training corpus, an annotation schema was designed that allows annotators to segment the Talmud and choose a purpose, speaker, target, and instance number for each segment. To increase the usefulness of the tagged data as a framework of mark-up for the Talmud, I created a closed-set of nine mutually exclusive possible purposes ranging from types of disagreement to types of agreement. The corpus itself consists of 50 pages of annotated Talmud consisting of approximately 3,000 segments. It was annotated by two annotators independently, and judged by the author to create a gold version of the corpus. Tagging was done using both Conditional Random Field and Maximum Entropy taggers, with the former being used where the setup of the tagging allowed for adjacent sets of features to come in the same order their related segments appeared in the Talmud, and the latter being used where this was not the case. Evaluations were made using accuracy, entropy, and set-matching f-score for purpose and speaker tagging. Precision, recall, and a normal f-score were used for target tagging.