Abstract
Tree-based machine translation has attracted people’s attraction for its outstanding ability of handling long distance reordering and discontinuous expressions. Inferring synchronous context free grammar (SCFG) from syntactic trees also becomes popular since it extracts cleaner linguistic phrases and has the potential of improving the SCFG quality. However, the incompatibility between word alignment and subtree alignment as well as the ambiguity of multiple subtree aligning points hamper the extraction of high quality rules.\r \r In this thesis, we present a new rule extraction approach for Hiero-like models using syntax trees as constraints. We directly take advantage of a hierarchically aligned Chinese- English parallel treebank (HACEPT) as a training resource to infer a minimal but effective SCFG. We first explore how to obtain a cleaner word alignment that prevents the correct subtree pair from being filtered out. Then, we propose an efficient and effective approach to obtain a minimal subtree alignment that yields high quality hierarchical rules. We also analyze the impact of our rule extraction approach on Machine Translation quality using a large scale parallel data set.\r \r Our experiments show that our approach could reduce the size of Hiero’s rule table by almost 90%, with only a slight loss of translation quality as measured by BLEU score.