Abstract
Accurate identification of phrasal translation equivalents is critical to both phrase-based and syntax-based machine translation systems. We show that the extraction of many phrasal translation equivalents is made impossible by word alignments done without taking syntactic structures into consideration. To address the problem, we propose a new annotation scheme where word alignment and the alignment of non-terminal nodes (i.e., phrases) are done simultaneously to avoid conflicts between word alignments and syntactic structures. Relying on this new alignment approach, we construct a Hierarchically Aligned Chinese-English Parallel Treebank (HACEPT), and show that all phrasal translation equivalents can be automatically extracted based on the phrase alignments in HACEPT.