Extending and Scaling up the Chinese Treebank Annotation

Nianwen Xue; Xiuhong Zhang

Back

Extending and Scaling up the Chinese Treebank Annotation

Conference paper

Open access

Extending and Scaling up the Chinese Treebank Annotation

Nianwen Xue and Xiuhong Zhang

2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2 (Tianjin, China, 12/21/2012 - 12/22/2012)

12/2012

Abstract

Treebanks (Linguistics)

Linguistic Annotation

Chinese Language or Literature

Computational Linguistics

We discuss on-going efforts to scale up the Chinese Treebank annotation and extending Chinese treebanking to informal genres like conversational speech, news groups and weblogs, as well as discussion forums. The original Chinese Treebank annotation scheme was designed for formal genres such as newswire and magazine articles, where the language is very formal and each document is carefully edited. When moving to informal genres, we can no longer assume that the data is error-free and we have to extend the annotation scheme to account for disfluencies. We show that the disfluencies can be characterized into a finite set of categories, consistent with what has been reported in theoretical linguistic literature. Treebanking is also a time-consuming process that requires extensive linguistic training from annotators, and the limited pool of qualified treebankers is a major obstacle for largescale treebanking efforts. To address bottleneck, we implemented a procedure that decomposes the treebanking process into five self-contained steps. In so doing, we reduced the cognitive load on the annotators at each step and thus enlarged the annotator pool, and we show that we are able to increase the throughput by 30%.

Files and links (1)

url

Extending and Scaling up the Chinese Treebank AnnotationView

paper textCC BY-NC V3.0, Open

Metrics

8 Record Views

Details

Title: Extending and Scaling up the Chinese Treebank Annotation
Creators: Nianwen Xue (Author) - Brandeis University, Michtom School of Computer Science
Xiuhong Zhang (Author) - Brandeis University
Conference: 2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2 (Tianjin, China, 12/21/2012 - 12/22/2012)
Number of pages: 8
Identifiers: 9924148844501921
Academic Unit: Benjamin and Mae Volen National Center for Complex Systems; Interdepartmental Program in Linguistics and Computational Linguistics; Michtom School of Computer Science
Language: Chinese; English
Resource Type: Conference paper

Extending and Scaling up the Chinese Treebank Annotation

Abstract

Files and links (1)

Metrics

Details

Brandeis University Social media