Abstract
Ancestral sequence reconstruction (ASR) is a powerful approach for testing hypotheses in molecular evolution. Using a maximum-likelihood framework, researchers can infer plausible ancestral protein sequences and resurrect them in the laboratory for experimental study. ASR depends critically on a model of sequence evolution, which helps estimate relationships among sequences in a multiple sequence alignment (MSA) and determine the most likely ancestral states. To reduce inaccuracies arising from model misspecification, the field relies on likelihood-based criteria to evaluate and rank phylogenies produced by competing evolutionary models. However, these criteria are typically applied only within data sets of fixed size, limiting their use in evaluating how alignment choices or model types (e,g., codon vs. amino acid) influence reconstruction accuracy.Although often treated as raw data in phylogenetic analysis, an MSA is itself a model – one that defines residue homology across sites, similar to how a phylogeny defines homology across sequences. The placement of gaps in a sequence alignment reflect hypotheses about indel events and site homology. Different alignment algorithms resolve indels differently, producing MSAs of varying lengths. These differences in data set size complicate model comparison using likelihood-based metrics. Consequently, researchers have turned to alternative approaches, including simulation-based methods and structure-based alignment scoring, to assess MSA quality for phylogenetic inference. However, simulation methods remain constrained by the assumptions of their underlying models, and structural homology is not necessarily equivalent to sequence homology. Thus, the extent to which MSA choice affects ASR accuracy remains an open question.
Protein sequences can be accurately encoded as either amino acid or codon strings. While both representations are related via the genetic code, codon models capture additional evolutionary signals – such as synonymous substitutions and selective pressures on protein sequences – not accessible in amino acid models. In principle, codon-based analyses should provide richer information. However, because codon data sets contain more information and are three times larger than their amino acid counterparts, likelihood-based comparisons again become problematic due to data set size effects. This creates a barrier to systematically comparing codon and amino acid models, phylogenies, or ancestral reconstructions.
Cross-validation methods offer a robust alternative. Unlike raw likelihood, cross-validation supports statistical comparison across models trained on data sets of different sizes. Extant sequence reconstruction (ESR) is a cross-validation framework developed for ASR. ESR exploits the time-reversibility of most common evolutionary models to reconstruct modern sequences using the same methodology as ASR. Because the same biases affect reconstructions of both extant and ancestral sequences, ESR provides a way to evaluate ASR accuracy indirectly but reliably.
Using ESR, I demonstrate that the accuracy of ancestral reconstruction depends far more on MSA choice than on the model of amino acid substitutions. Furthermore, I show that current codon models are more prone to misspecification than amino acid models. This analysis suggests that improvements in codon model design – particularly with regard to model fit and flexibility – could significantly enhance the reliability of ASR.