Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization

Jin Zhao; Jingxuan Tu; Bingyang Ye; Xinrui Hu; Nianwen Xue; James Pustejovsky

doi:10.18653/v1/2025.naacl-long.178

Back

Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization

Conference paper

Open access

Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization

Jin Zhao, Jingxuan Tu, Bingyang Ye, Xinrui Hu, Nianwen Xue and James Pustejovsky

2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Albuquerque, NM, 04/29/2025–05/04/2025)

DOI: https://doi.org/10.18653/v1/2025.naacl-long.178

Abstract

Computational Linguistics

Cross-Document Event Coreference (CDEC) annotation is challenging and difficult to scale, resulting in existing datasets being small and lacking diversity. We introduce a new approach to CDEC annotation that involves simplifying the document-level annotation task to labeling sentence pairs by leveraging large language models (LLMs) to decontextualize event mentions. This enables the creation of Richer EventCorefBank (RECB), a denser and more expressive dataset annotated at faster speed. We show that decontextualization 1 improves annotation speed without compromising quality and enhances model performance. Our base-line experiment indicates that systems trained on RECB achieve comparable results on the EventCorefBank (ECB+) test set, showing the high quality of our dataset and its generalizabil-ity to other CDEC datasets. In addition, our evaluation shows that existing state-of-the-art CDEC models that show high performance on other CDEC datasets still struggle on RECB. This suggests that the richness and diversity of RECB present significant challenges to existing CDEC systems and there is much room for improvement. All the data and source code are publicly available. 2

Files and links (1)

pdf

2025.naacl-long.178 (1)657.87 kBDownload View

Open Access

Metrics

1 Record Views

Details

Title: Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization
Creators: Jin Zhao - Brandeis University, Michtom School of Computer Science
Jingxuan Tu - Brandeis University
Bingyang Ye - Brandeis University
Xinrui Hu
Nianwen Xue (Author) - Brandeis University, Michtom School of Computer Science
James Pustejovsky - Brandeis University, Michtom School of Computer Science
Conference: 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Albuquerque, NM, 04/29/2025–05/04/2025)
Identifiers: 9924593845701921
Academic Unit: Michtom School of Computer Science; Benjamin and Mae Volen National Center for Complex Systems; Interdepartmental Program in Linguistics and Computational Linguistics
Language: English
Resource Type: Conference paper

Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization

Abstract

Files and links (1)

Metrics

Details

Brandeis University Social media