Abstract
Semantic textual similarity (STS) is a fundamental NLP task that measures the
semantic similarity between a pair of sentences. In order to reduce the
inherent ambiguity posed from the sentences, a recent work called Conditional
STS (C-STS) has been proposed to measure the sentences' similarity conditioned
on a certain aspect. Despite the popularity of C-STS, we find that the current
C-STS dataset suffers from various issues that could impede proper evaluation
on this task. In this paper, we reannotate the C-STS validation set and observe
an annotator discrepancy on 55% of the instances resulting from the annotation
errors in the original label, ill-defined conditions, and the lack of clarity
in the task definition. After a thorough dataset analysis, we improve the C-STS
task by leveraging the models' capability to understand the conditions under a
QA task setting. With the generated answers, we present an automatic error
identification pipeline that is able to identify annotation errors from the
C-STS data with over 80% F1 score. We also propose a new method that largely
improves the performance over baselines on the C-STS data by training the
models with the answers. Finally we discuss the conditionality annotation based
on the typed-feature structure (TFS) of entity types. We show in examples that
the TFS is able to provide a linguistic foundation for constructing C-STS data
with new conditions.