Proteomics/Protein Identification - Mass Spectrometry/Applications for mass spectrometry

Data_Analysis/ Interpretation	Protein Identification - Mass Spectrometry	Databases
	Applications for Mass Spectrometry

This Section:

Protein Identification

The process of protein identification through mass spectrometry is done in two main ways:

Peptide Mass Fingerprinting
Tandem MS

Peptide mass fingerprinting typically uses the masses of peptides derived from a spectrum as to check against a database of predicted peptide masses. These predicted masses are recorded from digestion of a list of well documented proteins. If a protein sequence has a significant number of predicted masses that match the experimental values, there is a excellent chance that the given protein is present in the sample. Matrix Assisted Laser Desorption Ionization - Time of Flight(MALDI-TOF) mass spectrometers are the instrumentation that is commonly used for this type of peptide analysis.

The other technique that is typically used is Tandem MS. This technique utilizes collision-induced dissociation. This process breaks proteins within the peptide backbone and because of this fragmentation, comparisons between the observed fragment sizes and the database of predicted masses is possible. A very similar method to the Tandem MS approach is peptide fragmentation fingerprinting or 'PFF'. Instead of utilizing collision-induced dissociation, this method uses enzymatic digestion of a single peptide to generate a fragmentation pattern. These fragments are analyzed and compared to a database of observed fragments for the particular enzyme in a method similar to the Tandem MS approach. Tandem MS and peptide fragmentation fingerprinting can also be used for protein sequencing.

Only about four to five peptides are adequate to accurately identify a protein with a known amino acid sequence. Surprisingly, as the identification information contained within databases increases, the amount of information required to identify a specific protein increases. Over time, the number of peptides required to identify a protein has risen. In some cases, a specific peptide fragment or a few fragments in rare combination are unique to a particular type or class of protein. This fragment or group of fragments can be used as signature peptides to identify the presence of the particular protein. These signature fragments exist for both collision-induced dissociation and enzymatic digestion. The recognition of a fragment as an effective signature fragment requires a significant amount of background information.

Protein Sequencing

Tandem MS can be used in conjunction with fragmentation methods such as trypsin digestion. This allows a sample to generate overlapping peptides of varying length. This method works in a similar fashion to the approach taken by genome sequencers, when they generate overlapping DNA sequence fragments and assemble into the genome. On the mass spectra, the mass difference between two overlapping peaks can be used to determine the difference in sequence based on the well documented data on the weights of the amino acids. Sequences are usually determined using one or more search algorithms linked to an appropriate database, such as the International Protein Index (IPI) at the European Bioinformatics Institute or the nr (non-redundant) protein sequence database provided by NCBI. Commonly used search algorithms include SEQUEST, X! Tandem, and MASCOT.

Quantitative Proteomics

There are several methods available for the quantitation of proteins using mass spectrometry. These typically involve some labeling method to differentiate between two different cell types. In isotopic labeling, heavier isotopes are introduced to one sample while lighter isotopes are used to label the other. Before mass spectrometry analysis, these two differentially labeled samples are mixed. The peptide fragments associated with the samples can be differentiated based on their mass differences. The quantification is made by the comparison of the peak ratio of the intensities on the mass spectrum to derive the relative abundance. Multiple methods currently exist which are widely employed in quantitative proteomics, including stable isotope labeling with amino acids in cell culture, isotope-coded affinity tags and metal coded tags.

The objective of quantitative proteomics is to determine quantitative information about a particular protein sample. Although qualitative provide insights that are of value, quantitative datasets give a larger understanding. Quantitative proteomics is a cellular technique that is based on the use of mass spectrometry. Such quantitative technique enables the detection of quantitative changes in proteins and their post-translational modifications in biological systems.

Generally, major quantitative proteomics methods assist in carrying out the analysis of protein complexes, where the priority is to gain access to information on the bona fide interactions of affinity tagged protein of interest. This action is crucial in providing a certain understanding to the novel interactions with known proteins, which can unravel the mysteries of unknown proteins. The bona fide protein interactions can be examined with bait proteins and protein complexes that are usually carried out with or without stable isotopes. The study of the interactions of protein complexes have been made possible by affinity purification and specific quantitative proteomics techniques -such as the combination of cleavable ICAT, iTRAQ, and DNA-affinity purification.

Quantitative proteomic analyses of protein complexes also serve the purpose of determining the changes in protein-protein interactions within a protein complex under various cellular conditions. For example, an approach based on SILAC (Stable Isotope Labeling by Amino acid in Cell culture) is used to examine the mechanics of the TATA binding proteins along with their phosphorylation states during the cell cycle. Moreover, SILAC provides an analysis of the circadian rhythm mechanistic features and the extracellular signal regulation of kinases interactions under certain conditions.

Profoundly, quantitative proteomics has grown a useful application in the analysis of protein interaction networks. Despite the advanced technological approaches in discovering the protein complexes, these complexes mostly function with other protein complexes and most proteins inhabit in more than one distinct complex. Quantitative proteomic therefore has the ability to confirm the many functions of protein complexes when interacting with protein complexes, and to address questions in determining the probability of protein-protein interactions.

Proteomics research is usually categorized into discovery and targeted proteomics. Discovery proteomics optimizes the identification of the protein by dedicating more time per protein sample and limiting the number of samples. On the contrary, targeted proteomics limits the number of features that will be analyzed, and then efficiently perform chromatography to obtain the highest sensitivity through many samples.

Furthermore, the use of mass spectrometry does not inherently produce quantitative results due to the ionization efficiency of the peptide sample. Because of this, relative and absolute abundance quantitative proteomics developed as sub-categorized techniques. Relative quantitation can be obtained by examining samples individually by mass spectrometry and comparing the spectra in order to actuate the peptide abundance in one sample in comparison to another. Often times, relative proteomic quantitation requires a labeling process involving stable isotope; this allows the mass spectrometer to distinguish between identical proteins into separate samples. Absolute proteomic quantitation utilizes isotropic peptides that evoke known concentrations of synthetic, dense isotopopgoues of target peptides into a sample and then performing liquid- chromatography- mass spectrometry. Like relative quantitation, absolute proteomic quantitation uses isotropic labels. However, the experimental target peptide sample is compared to heavy peptide and reverse- calculated to the starting concentration of the standard with the help of a pre-determined standard curve to allow for the quantitation of the target peptide.

Quantitative proteomics is a constantly developing and highly renewable field of study. Because of quantitative proteomic analysis, a large variety of biological problems have been solved and resolved.

Bacterial Identification

Mass spectrometry has been applied as a tool for bacterial recognition by "fingerprinting" species based on their proteomic inventory, though until recently work using this method was hindered by the procedure's reliance on high resolution data and the consequently high cost of sufficiently accurate instruments. Recent work in making the method more accessible has seen success by using MALDI (Matrix-Assisted Laser Desorption/Ionization) mass spectrometry, which uses comparatively low energy nitrogen lasers for ionization and also stabilizes large, fragile protein species by supporting them on a matrix. These modifications help minimize fragmentation of protein species and ultimately reduces the complexity of the resulting mass spectrum, lessening the need for high-resolution data and thus making the process possible using less expensive mid-range spectrometers.

Mass spectrometry-based bacterial identification is preferable to current differentiation methods which generally involve multiple tests to identify species based on physiological, serological, chemotaxonomic, and overall biochemical characteristics, each of which requires time consuming culturing steps. In contrast, mass spectrometry methods generally take on the order of a few hours and require a single culturing step to increase cell counts and purity.

Identification is achieved using databases not unlike those used currently to identify unknown chemical compounds. Multiple spectra of a specific bacterial species are taken and averaged to remove noise, and the resulting averaged spectra is used as the species' proteomic fingerprint. This is then stored in a public database, and when an unknown is tested its spectra is compared via statistical algorithms specific to the database and identified based on closeness of fit to members of the database inventory.

Pharmaceuticals

Within pharmaceuticals, mass spectrometry is being used to due to the complicated mixtures of samples such as blood, urine, lymph, and is used with high sensitivity methods to measure low doses and long time point data. The most commonly used technique is LC/MS coupled to a triple quadrupole mass analyzer.

The sensitivity of mass spectrometry affords the capability to carry out measurements of microdosing experiments which minimize the necessity of animal experimentation, with approximately a 70% congruence between observed effects of the compound between microdosing and animal model experimentation.

Disease Biomarker Detection

Research has shown that tumors can introduce proteins into various bodily fluids which might not normally exist at those concentrations in those locations. This has led to the interest in the study of mass spectrometry in the use of clinical early detection of cancer through analysis of these fluids. These disease-associated molecules are known as biomarkers. The issues associated with clinical usage of such analytical methods is the problem of false positives and false negatives. There are investigations into whether a protein signature composed of multiple proteins can be capable of providing a diagnostic measure of sufficient quality for use in a clinical setting.

There are a number of hurdles that exist for this type of mass spectrometry analysis.

Variations in the protein content of samples between individuals and even between the same person make the construction of a disease signature very complex
High dynamic range of protein concentrations are present, possibly with biomarkers existing on the fringes of that range
Accuracy and reproducibility of the mass spectrometry method(most frequently SELDI TOF)

Edge Fractionation Technology provides a powerful density based separation and enrichment method for rapid screening of potential Biomarkers. This method was used in a study, A Density-Based Proteomics Sample Fractionation Technology: Folate Deficiency–Induced Oxidative Stress Response in Liver and Brain, on biomarkers for oxidative stress, glutathione peroxidase 1 (GPx1) and glucose-regulated protein 75 (GRP75), as a result of folate deficiency.

References

Hopper S, Johnson RS, Vath JE, Biemann K., Glutaredoxin from rabbit bone marrow. Purification, characterization, and amino acid sequence determined by tandem mass spectrometry. [1]
Shotgun identification of protein modifications from protein complexes and lens tissue [2]
Hunt DF, Yates JR, Shabanowitz J, Winston S, and Hauer CR, Protein Sequencing by Tandem Mass Spectrometry
N Bandeira, H Tang, V Bafna, P Pevzner, Shotgun Protein Sequencing by Tandem Mass Spectra Assembly
Clarke W, Zhang Zhen, Chan DW. The application of clinical proteomics to cancer and other diseases. Clin Chem Lab Med 2003;41(12):1562-1570.
http://proteomics.cancer.gov/proteomics_basics/backgrounder.asp
Clinical proteomics: Are we there yet?
Proteomics and Cancer: Fact Sheet
Wagner M, Naiky D, Pothenz A, Protocols for Disease Classification from Mass Spectrometry Data
Lan W, Guhaniyogi J, Horn MJ, Xia JQ, Graham B. J Biomol Tech. 2007 September 18(4): 213–225. A Density-Based Proteomics Sample Fractionation Technology: Folate Deficiency–Induced Oxidative Stress Response in Liver and Brain
Sauer, S, et al. Classification and Identification of Bacteria by Mass Spectrometry and Computational Analysis. PLoS ONE 3(7): e2843.doi:10.1371/pournal.pone.0002843

Articles Summarized

Classification and Identification of Bacteria by Mass Spectrometry and Computational Analysis

Sauer, S, et al. Classification and Identification of Bacteria by Mass Spectrometry and Computational Analysis. PLoS ONE 3(7): e2843.doi:10.1371/pournal.pone.0002843

Main Focus

An alternative methodology coupled with database design in order to make bacterial identification with mass spectrometry more accessible.

Summary

The intention of Sauer et al. was to expand the use and improve the efficiency of mass spectometry-based microbiological identification procedures. Such procedures have been shown to be successful before, but the generally high cost of the equipment was exacerbated by the need for high-resolution data in identifying bacterial species based on their proteome. Consequently, the field of mass-spec microbe identification is not often used and is therefore both ill-developed and unproven in extensive field tests. However, the authors note that the potential of mass spec identification is appealing; current methods involve labor-intensive culturing and processing based on an assortment of tests used to elucidate the biochemical properties of the sample species, which takes considerable time and preparation. Genetic differentiation based on the 16S ribosome RNA sequence is possible and common, but requires pre-defined sequence data. In contrast, the authors’ described method can (ignoring culture times) be performed in 90 minutes or so, and requires only that a proteome “fingerprint” be established beforehand for comparison.

Common bacterial plating method used to distinguish between species of bacteria.

The proposed method centers on three key components which are intended to make the use of mass spec identification more accessible and accurate. First, the use of matrix-assisted laser desorption/ionization (MALDI) mass spectrometers allows for the identification of large, often fragile particles (specifically proteins or DNA fragments) by first fixing them on a supportive matrix and then ejecting them with comparatively low energy lasers (nitrogen based). This method allows for much larger fragments to be maintained than are generally possible with other mass spec methods, which in turn allows for much simpler mass spectra data and consequently a decreased reliance on resolution. Thus, a lower resolution (and therefore more affordable) mass spec system can be used for identification. The second implemented change was the use of single nucleotide polymorphism’s (SNPs) to differentiate between species with very similar proteomes (in the case of this study, to differentiate between European and North American strains of Erwinia amylovora). This was accomplished by sequencing the galE gene of the two strains and then using PCR to amplify this sequence using the necessary primers. This results in an amplified concentration of the galE gene, which could then be identified through MALDI mass spec as well. In this way it was now possible to differentiate between these species with approximately identical proteomes. The third and most crucial component to the broad-range use of mass spectrometric identification is the generation of a database of bacterial fingerprints. By generating mass spec data for a species twenty times, the data were averaged to give the most common spectrum for the species, which could then be compared to field results via statistical software to give a closeness of fit result not unlike the methods currently used for BLAST searches. The authors note that the database is available for free online, and that in addition to the 2800-plus strains already fingerprinted a global effort is anticipated which should allow the database to grow rapidly, particularly for species of interest in particular fields, such as plant infection.

With full implementation, the authors’ method for species identification is accomplished by first sampling a particular microbe species (such as a pathogen) and then culturing it to both increase the cell number and to allow for purification (minimizing the number of species being analyzed in a particular sample). Next, this culture can be fixed on the MALDI matrix and tested, possibly multiple times to allow for greater confidence levels. Assuming the species has already been fingerprinted, the resulting spectrum is statistically compared to those of the database and a closest fit is applied; with sufficiently high closeness values, a definitive identification can be made. In the event that the species is proteomically identical to other bacterial strains, database entries would also include SNP analysis results to allow this extra level of differentiation.

New Terms

MALDI: Matrix-assisted laser desorption/ionization mass spectroscopy; mass spec. method in which large, fragile molecules (such as proteins or DNA) are ionized by a “gentle” laser (such as nitrogen) and supported by a matrix surface, making them resistant to fragmentation. ( http://www.sigmaaldrich.com/analytical-chromatography/spectroscopy/maldi-mass.html )
SNP: Single Nucleotide Polymorphism; DNA sequence variation in which a genetic sequence is modified at one specific nucleotide, resulting in identical sequences with the exception of one pairing. ( http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism )
Robustness: Act of attaching bacterial species to a slide or stage, often with heat; in the case of this article, the act of attaching them to the support matrix. (http://en.wikipedia.org/wiki/Robust)
House-keeping genes: Genes expressed over any conditions of treatment of an organism (non-conditional expression). (http://www.biology-online.org/dictionary/Housekeeping_genes)
Dendrogram: Tree diagram used to illustrate gene clustering. (http://en.wikipedia.org/wiki/Dendrogram)

Course Relevance

This work is of considerable value in the study of metabolism given that it functions by establishing “fingerprints” of microbiological species which contain data from their entire proteome. While this “big picture” approach appears effective for differentiation between species, it is conceivable that mass spec identification using MALDI could be used to differentiate between different conditions of the same species; for example, between a microbe in aerobic vs. anaerobic conditions. This procedure is currently undertaken using SDS-PAGE or similar approaches, but these methods are difficult to automate and very time consuming. Mass spectrometry has shown itself to be easily automated, and with established methodology can be done very quickly.

A Density-Based Proteomics Sample Fractionation Technology: Folate Deficiency–Induced Oxidative Stress Response in Liver and Brain

Lan W, Guhaniyogi J, Horn MJ, Xia JQ, Graham B. J Biomol Tech. 2007 September 18(4): 213–225.

Main Focus

Folate is important in both brain and liver and using it as a factor to force disease in rats it is possible to identify unique proteins and potential biomarkers for identifying the disease outside a controlled experiment.

Summary

Folate is a cofactor in biosynthetic pathways crucial for DNA synthesis, DNA repair and various methylation reactions. Most mammals including humans can not synthesize folate and must attain it from their diet. Folate deficiency is linked to stroke, Alzheimer’s disease, Parkinson’s disease, depression, cancer and cardiovascular disease.

The process of protein fractionation.

Folate deficiency disrupts one-carbon metabolism, increasing the level of homocysteine, a cytotoxic amino acid that can induce DNA strand breakage, apoptosis and induce oxidative stress, which will cascade DNA damage, alter one-carbon metabolism and trigger multiple organ damage and neurodegenerative disorders. Oxidative Stress and oxidative damage can alter protein functions in many ways, including directly modifying catalytic amino acids, changing critical amino acid residues in binding or regulatory sites, and altering proteolytic susceptibility and states of aggregation. In animals the liver is an important organ for folate storage and metabolism. Two groups of four rats each were given a folate-defined diet for two weeks, and then the experimental group was switched to a folate deficient diet. After four more weeks their organs were harvested. The proteins obtained from the organs were separated into 11 samples based on density using Edge 200 Fractionation. Fractionation is used to increase the visibility of less prevalent proteins. Each of these samples were measured with mass spectrometry to determine at which densities folate and biomarkers indicative of oxidative stress occur. In folate deficient rats, GPx1, a protein involved in the control of oxidative stress, was increased in abundance by 50 to 70% compared to rats with a diet containing folate. Oxidative stress induced by folate deficiency is less in brain tissue than in liver tissue. Several proteins related to oxidative stress were found in the experimental rats. Changes in many proteins of unknown function or unknown relation to folate deficiency were also observed in the experimental rats. Western blot results show changes in the relative percentage distribution of markers in each fraction, indicating their potential translocation in the cell.

New Terms

Edge 200 Fractionation: Non-denaturing front-end sample fractionation technology. It works by suspending the particles in an initial medium of known density. The suspension is pelleted and the supernatant is saved as the first sample. The remaining pellet is resuspended in a medium of increased density and the process is repeated for as many fractioned samples desired.
Oxidative Stress: An imbalance between the production of reactive oxygen and a biological system's ability to readily detoxify the reactive intermediates or repair the resulting damage.([3] 04-12-09)
Epidemiology: Study of factors affecting the health and illness of populations.([4] 04-12-09)
Ad libitum: Latin for “at one's pleasure”([5] 04-12-09)
Snap Frozen: Frozen by dipping in liquid nitrogen.([6] 04-12-09)
Brain PNS: Peripheral Nervous System, resides outside the central nervous system(brain and spinal cord).([7] 04-12-09)

Course Relevance

One of the biggest problems in Proteomics is separating specific proteins from a sample. This article shows the use of Edge 200 Fractionation to separate all the proteins in a sample to a user specified number of density specific samples.

Websites Summarized

Systems Biology

Ranish J. http://www.systemsbiology.org/technology/Data_Generation/Mass_Spectrometry_Analysis (3-20-09)

Main Focus

Use of Mass Spectrometry to find Post translational modification (PTM) of proteins which play a key role in the control of a wide range of biological functions and activities is enhanced by increasing the concentrations of the PTMs.

Post Translational Modification of Insulin

Summary

Analysis of Post Translational Modified (PTM) proteins is difficult because the modified proteins exist in far lower concentrations than the unmodified versions of the proteins and the modification itself is a hinderer to mass spectrometry. By increasing the concentration of modified form of the protein and using computational algorithms designed specifically for the analysis of the unique MS/MS fragmentation patterns generated by post-translationally modified peptides, the chances of identifying the modified peptide during mass spectrometric analysis are greatly improved. Enriching methods include purification to homogeneity and conjugation to a solid support using hydrazide chemistry and stable isotope labeling. Further advances in defining PTMs are being developed such as; Peptide mixtures to be methyl esterified to block carboxylates from further reaction in subsequent steps, and to introduce stable isotopes for quantitative analysis, elution and bonding to polyamine.

New Terms

Post Translational Modification (PTM): The chemical modification of a protein after its translation. It is one of the later steps in protein biosynthesis for many proteins. (http://en.wikipedia.org/wiki/Post_translational_modification 4-12-09)
Sumoylation: A PTM involving the attaching or detaching of Small Ubiquitin-like MOdifier proteins (http://en.wikipedia.org/wiki/SUMOylation 4-12-09)
MS/MS: Multiple Steps Mass Spectrometry with some form of fragmentation occurring in between the stages (http://en.wikipedia.org/wiki/MS/MS 4-12-09)
Sub-Stoichiometric Amounts: An amount to small to calculate the relationships between reactants and products. (http://www.answers.com/topic/stoichiometry 4-12-09)
Hydrazide Chemisty: Reactions involving binding to organic compounds sharing a common functional group characterized by a nitrogen to nitrogen covalent bond with 4 substituents with at least one of them being an acyl group. (http://en.wikipedia.org/wiki/Hydrazide 4-12-09)

Course Relevance

A large hurdle in proteomics is finding and isolating proteins of very low concentration. This website contains methods for surpassing this problem.

National Center of Biotechnology Information Taxonomy Database

http://www.ncbi.nlm.nih.gov/Taxonomy

Main Focus

Establishing an expansive, freely accessible resource for the cataloging of current and future biological data.

Summary

The National Center of Biotechnology Information acts as a global center for accumulation of biotechnology-related information in the form of publicly accessible databases, and their Taxonomy database is maintained to be as up-to-date as possible in terms of the current classification of various organisms based on collaborative data. As data are generated by an assortment of independent research bodies, submission of this data to the NCBI website benefits authors in two ways in particular: by confirming their work through comparison to data in existence both before and after publication, and by making it available for citation.

For the general scientific community, the NCBI website is just one of many separate attempts to publicize and make both available and searchable the vast amounts of incoming scientific data published every year. All fields benefit from easy access to relevant data, but nowhere is this more apparent than the field of biology, in which new data is being constantly accrued over time as new methods are proven and then applied. The vast number of known organisms coupled with the continuous improvement of analytical methods ensures that a limitless amount of data will always remain to be gathered, for species and systems both newly and previously studied. Further, as the much of the global community continues to improve in terms of education and economic status, the number of individual outfits generating data is continuously increasing. This influx of data would be impossible to archive effectively in the methods of the past several centuries; individual genomes would fill entire volumes, and even these would need to be updated with such regularity that printing would need to occur more than annually. Most importantly, this data would be incredibly hard to access in terms of both cost and searching. Single procedures often necessitate the use of data gathered by anywhere between dozens and hundreds of sources, which would make locating this information incredibly difficult, and much data (such as protein or DNA sequence comparisons) requires the comparison of thousands of data points (base pairs), making computer application an absolute necessity.

For all of these reasons the NCBI Taxonomy website has been created and has quickly become an incredibly useful source for biological and related studies. The site allows for the searching of any species currently known and archived, and for each offers tabular summations of current works available, such as genome sequences or projects, known protein and/or other biochemical structures, and even the number of times a particular species has been cited in PubMed databases. Even more importantly, all of these references and data points are linked, allowing for incredible amounts of data to be accessed in moments. It is safe to assume that without NCBI and other public databases, many of our most recent biological and biochemical breakthroughs would have taken much longer, or may have been altogether impossible.

New Terms

GEO: Gene Expression Omnibus; a public database based at NCBI containing microarray and ChiP-chip data. (http://www.ncbi.nlm.nih.gov/geo/)

ChIP-chip: ChiP-on-chip; microarray method used to observe interactions between DNA and DNA-binding proteins. (http://www.chiponchip.org/)

PopSet: Collection of DNA which as been sequenced for statistical comparison of populations' evolutionary similarities. (http://www.ncbi.nlm.nih.gov/sites/entrez?db=popset)

SRA: Short Read Archive; archive for incomplete sequences of DNA from a particular organism generally containing several million base pairs each. (http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&cmd=Search&dopt=DocSum&term=txid562[Organism%3Aexp])

HomoloGene: Automated gene homologue detecting system; compares sequence data against completed genomes of eukaryotic organisms to detect shared gene characteristics. (http://www.ncbi.nlm.nih.gov/homologene)

Course Relevance

The value of websites like the NCBI Taxonomy website cannot be overstated in the current, ever-expanding state of biology, for both its usefulness in research ventures and its crucial archival functions.