Resampling Methods for Protein Structure Prediction
Blum, Benjamin Norman
Technical Report Identifier: EECS-2008-184
December 22, 2008
Abstract: Ab initio protein structure prediction entails predicting the three-dimensional conformation of a protein from its amino acid sequence without the use of an experimentally determined template structure. In this thesis, I present a new approach to ab initio protein structure prediction that divides the search problem into two parts: sampling in a space of discrete-valued structural features, and continuous search over conformations while constraining the desired features. Both parts are carried out using Rosetta, a leading structure prediction algorithm. Rosetta is a Monte Carlo energy minimization method requiring many random restarts to find structures near the correct, or "native" structure. Our methods, which we call "resampling" methods, make use of an initial round of Rosetta-generated local minima to learn properties of the energy landscape that guide a subsequent resampling round of Rosetta search toward better predictions. One of the main innovations of this thesis is to attempt to deduce from the initial set of Rosetta models not the entire native conformation but rather a few specific features of the native conformation. Features include backbone torsion angles, per-residue secondary structure, exposure of residues to solvent, and a three-tiered hierarchy of beta pairing features. For each feature there is one "native" value: the one found in the native structure. Native feature values are generally enriched in structures with low energy, as the native structure of a protein is significantly lower in energy than non-native structures and the energy of a protein is to some extent the sum of spatially local contributions. We have developed two methods for feature-space resampling based on this observation. The first method employs feature selection methods to identify structural feature values that give rise to low energy, which are then enriched in the resampling round. The second, more sophisticated method updates the sampling distribution for all features at once, not just a selected few, by predicting the likelihood that each feature value is native. Our results indicate that both methods, especially the second one, yield structure predictions significantly better than those produced by Rosetta alone.