Improving mass spectrometry data searching workflow to maximize protein Identifications Shadab Ahmad1, Amol Prakash1, David Sarracino1, Bryan Krastins1, MingMing Ning2, Barbara Frewen1, Scott Peterman1, Gregory Byram1, Maryann S. Vogelsang1, Gouri Vadali1, Jennifer Sutton1, Mary F. Lopez1 1Thermo Fisher Scientific, BRIMS (Biomarker Research in Mass Spectrometry), Cambridge, MA 2Massachusetts General Hospital, Boston, MA Overview Purpose: Development of a comprehensive protein identification workflow that helps identify more high confidence peptide/protein IDs including post translational modifications than traditional workflows. Methods: Use of combinations of multiple search engines (e.g., SEQUEST and Mascot) where combinations of PTMs were judiciously chosen for each node based on uniprotKB-relative PTM abundances from high-quality, manually curated, proteome-wide data1. Results: Tremendous enhancement in the high confident percolator validated peptide/protein identification compared to standard SEQUEST and MASCOT workflow. Introduction Mass spectrometry has become an established method for protein identification and characterization in recent years. The number of protein identification from complex biological samples depends on many factors, ranging from data acquisition strategy to MS/MS data searching methods. Unfortunately, only a fraction of spectra generated have confident peptide matches for any complex biological sample. There are several factors that are being overlooked by many users in data searching strategy including appropriate combination of post translational modifications (PTMs), coding SNP2, isoforms of proteins, iterative searching etc. that can possibly help identify these unmatched spectrum. We herein develop a comprehensive protein identification workflow that helps identify higher number of high confidence peptide/protein IDs and also identify multiple PTMs and partially cleaved peptide in a single run. with Thermo QExactive benchtop mass spectrometer, with top 15 data dependent MS/MS using HCD fragmentation. Data Analysis The acquired data was searched with proteome discoverer 1.4 (Thermo Fisher Scientific) using comprehensive workflow and also with general SEQUEST workflow with standard PTMs (oxidation at methionine as dynamic modification and alkylation as static modification) coupled with percolator validation (General Search). Results Peptide Identification We compare the results from our comprehensive searching workflow with general search. We found that on average, the number of high confidence peptides identification (FDR≤0.01) increased by approximately 70% with our comprehensive workflow as compared to general searches, whereas the number of medium confidence peptides identification (FDR≤0.05) increment was twice as compared to general searches (figure2). Moreover the comprehensive workflow identified several high confident peptides with multiple PTMs which reveal the importance of right combination of PTM in a search node (table1). Table1. Examples of peptide containing multiple PTMs from Comprehensive search. Sequence Modification R1(ADP-Ribosyl); G7(Myristoyl); RATTVTGTPCQDWAAQEPHR C10(Carboxymethyl) S2(Phospho); S4(Phospho); K8(Methyl); VSHSPPPKQRSSPVTK R10(Methyl) LLIYAASSLETGVPSR Y4(Phospho); A6(Acetyl) M9(Oxidation); C10(Carboxymethyl); LVRPEVDVMCTAFHDNEETFLK F13(Amidated); E17(Carboxy); F20(Amidated) q-Value ≤0.001 ≤0.001 0.007 ≤0.001 We further investigate the matched and unmatched spectra while using general search and our comprehensive search. We found that the percentage of matched spectra improves significantly when using comprehensive search workflow (figure 4, table2). FIGURE 4. Comprehensive workflow increases number of matched spectra. FIGURE 2. Comprehensive workflow increases number of peptide identification Methods Comprehensive workflow development We developed a comprehensive MS/MS searching workflow within Proteome Discoverer using a combination of multiple search engines (Figure1) in an iterative fashion to maximise number of protein/peptide identification by considering the most frequently found PTMs1; sequence-isoforms of proteins; and partially cleaved peptide etc. Effect of various factors on peptide identification were explored and implemented in the process that include protein isoforms, missed cleavage sites, semi tryptic digestion and most importantly appropriate combination of PTMs in each search node. The combination of PTMs were judiciously chosen based on uniprotKB-relative abundances of each PTM found experimentally and putatively, from high-quality, manually curated, proteome-wide data1. The workflows were tested on plasma and urine samples acquired on a hybrid Orbitrap mass spectrometer. Table2. Comparative table for matched spectra FIGURE 1. Structure of Comprehensive workflow The comprehensive workflow found to increase the number of high confident protein (FDR≤0.01) by 63% and the high confident grouped protein by 44% with respect to the general search. Moreover the comprehensive workflow increases the high confident group proteins (with at least two high confident peptides for every protein in the group) by 15% (figure3). File Total Spectra Matched Matched Spectra Matched Matched Spectra Spectra General Comprehensive Spectra General Comprehensive Search Search Search Search (FDR≤0.05) (FDR≤0.05) (FDR≤0.01) (FDR≤0.01) Sample1 27215 27.9 % 43.5 % 26.0 % 38.5 % Sample2 14005 15.5 % 34.4 % 14.5 % 30.1 % Sample3 30026 19.9 % 32.8 % 19.1 % 30.1 % Sample4 60770 8.2 % 18.1 % 8.0 % 16.8 % Conclusion FIGURE 3. Comprehensive workflow increases number of grouped protein identification (with at least two peptide hits per protein) Comprehensive workflow identified approximately 70% more high confident peptide as compare to general search strategy. The comprehensive workflow helped increase the number of high confident protein identification and high confident grouped protein identification by approximately 63% and 44% respectively as compared to general search approach. Comprehensive workflow identifies large number of high confident peptides with multiple PTMs. Sample Preparation In order to evaluate the performance of the comprehensive workflow we took four human samples from two different sources (a) Urine and (b) Plasma (three samples). Human urine and plasma samples were collected with full consent and approval. The samples were subjected to reduction and alkylation followed by digestion with trypsin. Liquid Chromatography and Mass Spectrometry The digested samples were separated with C18 column with 5-45% acetonitrile gradient in 0.1% formic acid through nano-LC system. The urine sample (sample no. 1) and a plasma sample (sample no. 2) were run for 140 minutes and 90 minutes respectively and the data were acquired with LTQ Orbitrap Velos MS with top 11 and top 10 data dependent MS/MS respectively using CID fragmentation . Another two plasma samples (sample no.3 and 4) were run for 250 minutes and 240 minutes respectively and the data were acquired The percentage of matched spectra improves significantly when using comprehensive search workflow. References 1. Khoury GA, Baliban RC, Floudas CA. Proteome-wide posttranslational modification statistics: frequency analysis and curation of the swiss-prot database. Sci Rep. 2011 Sep 13;1. 2. Schandorff S, Olsen JV, Bunkenborg J, Blagoev B, Zhang Y, Andersen JS, Mann M. A mass spectrometry-friendly database for cSNP identification. Nat Methods. 2007Jun;4(6):465-6. SEQUEST and Percolator are registered trademarks of University of Washington. All other trademarks are the property of Thermo Fisher Scientific and its subsidiaries. This information is not intended to encourage use of these products in any manners that might infringe the intellectual property rights of others.