Structural Biochemistry/Proteins/Developing Novel Classifications of Protein Structure

Proposed New Protein Structure Classification

Three scientist in the field of structural biochemistry from the University of California San Diego(Ruben E. Valas, Song Yang, Philip E. Bourne), have proposed a new method of protein classification. This idea comes as a consequence of the great breadth of macromolecular structures having been solved and the many, yet, to not have been illuminated. This poses a grave problem of assimilation of the large amounts of structural information available. Secondly, it seems that the present manner of classification seems insufficient to unveil the great network of structural lineages that evolution has paved and therefore, their strategy is to employ a reductionist approach to better interpret the evolutionary basis of protein structure and the lineage amongst the diverse populations of such structures.

Two methods of protein classification are readily used today:

Bottom-up Approach

The bottom-up approach uses algorithms to in an attempt to compare proteins based on geometry, the ability to superimpose using a root means-square deviation(RMSD), length of alignment, number of gaps, and a score of statistical significance. The end result is a proteins domain comparison which renders very little biological significance.

Because of the diversity of methods available, there is usually more than one result for each sequence of amino acids analyzed. One drawback to the bottom-up approach is that, since sequences of amino acids in their primary state do not reveal much about the biological function of the protein, it is impossible to decide which one of the results is the most biologically important one. The benefit to the bottom-up approach is that it is a useful bit of reductionism that does give a representative comparison of different protein domains, which can prove useful.

Top-down Approach

Top-down approaches are considered today's gold standards as exemplified by CATH and SCOP. These methods primarily utilize homogous sequence comparisons to reflect a relationship among different protein domains and as a result a biological context. The authors agree that this technique can be taken one step further based on the premise that structural classification is developed as a consequence the evolutionary links among species. Furthermore, the authors propose to incorporate issues of gene duplication, convergence versus divergence, and co-evolution in a functional context as ideas that should be used in the future for protein classification.

The protein domain: a good unit of structural classification?

Both the bottom-up and top-down approaches rely on protein domains as the units of comparison. Domains are complicated units. Some domains have similar sequences and are evolutionarily related, some domains are vaguely related, with similar structures but different sequences, and some domains are similar topologies, but not enough to establish an evolutionary connection. The basic problem is that a domain can be an evolutionary or non-evolutionary unit. Many proteins are multi-domain proteins, which further increases the complexity.

The presence of folds, which are considered discreet components in most top-down classifications, further complicate matters. Folds are not a direct result of evolution, but they do provide insight into evolutionary practices. Folds sometimes change during evolution; it is possible for an alpha fold to change into a beta fold through a secondary structural change. It is also possible to create two peptides with similar sequences but different folds, leading to completely different functions. There are also chameleon sequences that can take on multiple different folds. Because of the diversity of structural variation in regard to folds, folds are not suitable units of classification. In essence, whether or not two proteins are in the same fold is really semantics, whereas determining which one led to the other evolutionarily actually gives insight into the relationship between proteins. The reason it has not been widely used is simply that it is more difficult than clustering similar structures.

Examples of Evolutionary Selection

Valas et al. present the prevalence of evolutionary selection by give two examples that highlight this phenomenon. The first, Basu et al. found in the genomic analysis of 28 different eukaryotic cells, that there were 215 strongly promiscuous domains. Basu et al. define strongly promiscuous as those domains that occur in diverse domain architectures, where these architectures are represented as a linear combination of these domains. "Domain architectures arise through domain shuffling, domain duplication, and domain insertion and deletion leading to new functions." The degree of dmain promiscuity depends on the frequency of being with different domain partners. The second example is by Vogel et al. which found over-representation of 2-domain or 3-domain combinations which were coined, "supradomains" or macrodomains. These are structure that throughout proteins evolution have proven to have stable internal domains. Over 1400 of these macrodomains have been found which show a natural associativity which seems to be evolutionarily advantageous.

Pluralistic Approach to Protein Classification

The protein domain has been the only manner of evaluating the of evolution protein structure. Although the evolutionary analysis of the protein domain alone has proven successful at evaluating protein structure, it seems that there needs to be other factors contributing the unknown pieces of the evolutionary network. Therefore, the authors propose using a pluralistic approach to protein structure classification which includes incorporating not just domains, but subdomains, macrodomains, and both convergent and divergent evolution. In regards to subdomains, the authors mention areas of subdomains that could be important components to connecting the evolutionary network of proteins.

There are many tools that can be used to compare proteins at the subdomain level. One database called Fragnostic facilitates analysis based on fragments from different proteins that share structural and/or sequence similarity. The edges of the fragments are ambiguous; that is, they are not defined as divergent or convergent evolution, but combined with other information the fragments can be tested for structural evolution.

Closed loops are another subdomain unit. Most protein structure consist of loops spanning 25-30 residues. Domain Hierarchy and closed Loops (DHcL) uses van der Waals energies to elucidate domains and closed loops from protein structures. Researches have discovered that fragments that correlate to closed loops were more likely to form large clusters, which have connections to one another. This description might represent a more detailed view of protein function. Similar closed loops in different structures can be evidence that those structures once shared a common ancestor.

Another subdomain unit is the functional site. Many different proteins can bind to the same ligand, which implies that perhaps they share a common ancestor that bound to the ligand in question. The proteins diverged in structure during evolution, but the functional site remained. SMAP can find such functional site that have both sequence and structural conservation, a perfect example of divergent evolution. On the other hand, different proteins can converge on the same ligand. The PROCOGNATE database uses information from the PDB to put together which proteins bind to which ligand. A combination of these methods could incorporate both divergent and convergent evolution.

Besides subdomains, macrodomains can also be used to aid in classification. Divergent evolution is evident in some protein–protein interaction sites (a macrodomain feature). In those cases, while the proteins differentiate over time, the domain interface stays the same. Many of the protein–protein interfaces in the PDB contain very similar interfaces in vastly different proteins.

In essence, a domain-based scheme would not be as efficient, as it would only be able to determine that the proteins evolved from a common ancestor, while an examination that includes analysis of both subdomains and macrodomains would provide an evolutionary hypothesis. One problem posing the pluralistic approach to protein classification would be convergent evolution. The fact that two proteins with completely different evolutionary lineages can come together to have very similar structures can pose a great problem for connecting the protein evolutionary network.

The authors argue that to obtain the last universal common ancestor(LUCA) of the protein, it is necessary to look at more than the amino acid sequence as has been done but incorporate other structural aspect to be able to mesh the evolutionary puzzle.