Abstract
Ancestral sequence reconstruction (ASR) has been used to analyze the properties of ancient proteins by elucidating the processes of molecular evolution that have led to extant proteins. The mutations and resultant functional changes identified along protein lineages have addressed underlying evolutionary questions through the application of ASR. Here, we take a novel approach by applying ASR to understand how the mechanism of Cryptosporidium parvum lactate dehydrogenase (CpLDH) evolved from an ancestral malate dehydrogenase (AncMDH). We find that a change in the rate-limiting step appears to accompany an active-site rearrangement in AncMDH that caused a switch in substrate specificity from malate to lactate. This is the first example in which a study elucidates the evolution of an enzyme mechanism. Two long-standing questions in the ASR field are whether reconstructions are accurate and how different models affect that accuracy. The primary reason these two questions have not been addressed is that we cannot benchmark ancestral sequences reconstructions against a true ancestral sequence for real biological data on an actual geologic timescale. However, we are able to answer these questions by developing two cross-validation (CV) methods and applying them to real extant sequences. Thus, we are able to use these two CV methods to determine whether reconstructions are accurate and how various models of evolution affect accuracy.
Each phylogenetic dataset is unique and one of the first steps in phylogenetics is to select the evolutionary model that best fits the data. We expect that selecting a more predictive model should positively affect ancestral reconstruction accuracy. However, it is not clear how to interpret and apply some of the more popular model-selection criteria like the Akaike information criterion (AIC), which was developed to assess the predictive power of a model. Therefore, we assess the ability of a chosen evolutionary model to predict each aligned column of extant data given the remaining columns (i.e. column-wise CV) to avoid the ambiguity in the AIC. We find that column-wise CV generally prefers less complex models than the AIC, which is significant because under our certain restrictive assumptions the AIC and CV are supposed to be equivalent.
Each probabilistic phylogenetic model has different parameters that account for varying evolutionary phenomena. Yet we are unable to quantify the effect of these model parameters on ancestral sequence reconstructions because we do not know the true ancestral sequence. To address this problem, we calculate the posterior probability distribution for an individual extant sequence given the remaining sequences (i.e. sequence-wise CV) using ASR methodology, a method we term “extant sequence reconstruction” (ESR) to validate sequence reconstruction accuracy. Thus, we evaluate the accuracy of ASR methodology by comparing extant reconstructions to the corresponding true sequences. We find that a common ASR quality control metric, the average probability of the single most probable (SMP) sequence, is not a reliable indicator of model predictiveness because the average SMP sequence probability generally decreases as the model improves. In contrast, the entropy of the reconstructed distribution is a reliable indicator of the quality of a reconstruction, as the entropy provides an increasingly accurate estimate of the log-probability of the true sequence as model predictiveness increases and becomes closer to the true model.
Both column-wise CV and ESR are useful methods to evaluate evolutionary models used for ASR and can be applied in practice to any phylogenetic analysis of real biological sequences. In the future, we hope to “resurrect” extant sequences in the future to address other long-standing questions in the ASR field.