MS-Fit Instructions

Description, Instructions, and Tips for MS-Fit

Purpose

This document provides instructions for MS-Fit.

Contents of this document:

Introduction and Background
AA Substitutions (homology mode)
Report Homologous Proteins
Minimum Number of Peptides Required to Match
Ranking / Scoring of Results
Multiply-charged ions
Monoisotopic/Average Flags
Searching for Mixtures
Looking for Peptides with Non-Specific Cleavages
Hit Statistics

Links to topics in the general instructions:

Search Times
Stopping / Cancelling a Search
Saving Hits from one Protein Prospector program, searching them with another
Databases
Species Filtering
Species Code Filtering
Intact Protein MW Filtering
Intact Protein pI Filtering
Enzyme specificity / Missed cleavages
Frame Translation in DNA databases
General features of links from program output
Link from the accession number in program output to an annotated remote database entry
Link from the MS-Digest index number in program output to MS-Digest
Link from the peptide sequence in program output to MS-Product
Constant Modifications
Modifying Amino Acids
Mass (m/z)
Mass type
Charge (z)
Mass Tolerance
Sample ID (comment)
Max. Reported Hits
Contaminant Masses

Introduction and Background

MS-Fit was the first program Karl Clauser and Peter Baker developed together. The name stems from the program's expected usage: correlating Mass Spectrometry data (parent masses only, not fragment masses) with a protein in a sequence database which best Fits the data. Note that the word fit was chosen and NOT the word identify. In the spring of 1995, when the name was selected, the typical peptide mass fingerprinting experiment preceding use of MS-Fit was to digest a protein with an enzyme, then perform MALDI mass spectrometry on the resulting mixture of peptides to determine the masses of each peptide. At that time the state of the art mass accuracy using MALDI on a continuous-extraction, reflector time-of-flight instrument was +/- 0.5 Da. This mass accuracy level was poor in comparison to the standard of +/- 10 ppm established with magnetic sector instruments several decades earlier. Thus in our opinion MS-Fit could in favorable cases (where both species and approximate intact protein molecular weight were known) merely suggest protein identity. To establish protein identity one needed, in our opinion, some sequence support. This support could be obtained from the combined use of MS/MS and our subsequently developed program MS-Tag .

The development of delayed extraction MALDI in 1996 has tremendously improved the accuracy of mass measurement on reflector MALDI-TOF instruments. Mass accuracy in the range of 5-100 ppm is now possible. The low end of this range (best mass accuracy) is accessible with internal calibration and long flight tubes, while the high end of the range is accessible with external calibration and short flight tubes.

Consequently, proteins can now be confidently identified by peptide mass fingerprinting using masses alone with MS-Fit. Identification certainty is primarily a function of the level of mass accuracy.

AA Substitutions (homology mode)

Selecting any search mode except identity puts MS-Fit into homology mode by invoking the MS-Tag mutation matrix routine. In this mode the MS-Fit routines for possible modifications are bypassed. Instead the set of modified AA's allowed in MS-Tag Homology mode is used.

In practice, homology mode should only be used when one or more of the following conditions applies:

peptide mass data has excellent mass accuracy (+/- 10 ppm or better)
a narrow intact protein MW filter is used
the Hits will be saved and searched via MS-Tag

MS-Fit matches a database sequence with a calculated peptide mass which pass through one of the peptide mass filters. Normally the filters are determined by the user-supplied peptide masses +/- the peptide mass tolerance (standard filter). In Homology Mode these filters are re-configured to the user-supplied peptide masses +/- the peptide mass shift (see section of MS-Tag manual on parent mass shift for full details). However, a particular protein entry in the database is not subjected to these widened homology filters unless a preliminary cut-off number of user-supplied peptide masses first match in a standard-filter search. This preliminary cut-off is controlled by the parameter: Min. # matches with NO AA substitutions.

Database sequences passing the homology widened peptide mass filter are then passed through a mutation matrix to try and find a single AA substitution which would transform the calculated mass of the database sequence to the experimentally determined mass. The output displays the necessary substitution and the corresponding sequence consistent with the experimental peptide mass data (not the sequence present in the database).

Report Homologous Proteins

This defines how to deal with homologous proteins. The default is 'interesting' which means a homologous protein will only be reported if there is at least one unique peptide matching to the protein. Occasionally proteins will be reported as homologous when the level of homology may be fairly low (e.g. only two out of ten peptides are identical between proteins).

Minimum Number of Peptides Required to Match

In order for a particular protein in the database to generate a hit it must match at least Minimum Number of Peptides Required to Match masses from the input data.

Ranking / Scoring of Results

The MOWSE score reported by MS-Fit is based on the scoring system described in Pappin et al, Current Biology, 1993, Vol 3, No 6, pp 327-. As MS-Fit offers several options not available in the initial version of MOWSE several modifications have had to be made.

After the species and molecular weight pre-searches the remaining proteins undergo theoretical digestion. The resulting peptides are then placed in bins based on their molecular weight and the intact molecular weight of undigested protein they originated from. There are eleven intact molecular weight bins. Under 100000 Da there are 10 bins of width 10000 Da. The other bin contains all the proteins over 100000 Da There are thirty peptide molecular weight bins of width 100 amu between 0-3000 Da Peptides above 3000 Da are not binned. Peptides with no missed cleavages contribute 1.0 to the bin total whereas peptides containing missed cleavages contribute pfactor (a user supplied parameter).

Bin frequency values are then calculated by dividing the bin totals by the sum of the bin totals for each 10000 Da protein interval. The bin frequency values are then normalised to the largest bin frequency value to yield frequency values between 0 and 1.

Masses in the theoretical digestion which match masses in the data set are divided into scoring matches and non-scoring matches. Scoring matches include unmodified peptides and acrylamide modified Cys and N-terminal Gln to pyroGlu and oxidation of Met in the presence of the unmodified peptide. Non-scoring matches include pyroGlu and oxidation of Met in the absence of the unmodified peptide, acetylated N-termini, phosphorylation of S, T and Y and single amino acid substitutions. Unmatched masses are ignored. The score for each matching mass is assigned as the appropriate normalised distribution frequency value. In the case of multiple matching masses the scores are multiplied together. The final product score is inverted and normalised to an average protein molecular weight of 50 kD.

If scoring is not selected MS-Fit uses a simple ranking system. The results are sorted so that if multiple database entries are matched, more likely sequences are listed higher in the list. All database entries matching the input data and parameters are ranked on the following basis:

Database entries with the least number of unmatched masses are ranked higher.
Among equivalent matches (those with the same rank) the results are sorted in order of increasing index number.

Note that the last sort does NOT imply a BETTER ranking, even though one match will be listed higher than another, but is merely intended to provide some organization to the listing.

Multiply-charged ions

Multiply charged ions are handled in a similar way in all Protein Prospector programs.

Monoisotopic/Average Flags

Monoisotopic/Average Flags can now be set in a column to the right of the mass/charge column. You must first set Peptide masses are: to monoisotopic and then enter a column of 0's and 1's here to state whether an m/z value is monoisotopic (0) or average (1). There must be the same number of 0/1's as there are m/z values. If the column is left blank then all the values are assumed to be monoisotopic as before.

This option is currently only available to licensees. The appropriate items on the MS-Fit HTML input page are normally commented out.

Searching for Mixtures

At the end of each hit in the MS-Fit detailed report there is a link which allows you to do a subsequent search just using the unmatched masses. Subsequent searches use the same ratio of masses submitted to masses required to match as was used in the original search.

Looking for Peptides with Non-Specific Cleavages

At the end of each hit in the MS-Fit detailed report there is a list of unmatched masses. If you click on one of these masses you can see if the mass matches any peptides in the protein that was hit. The usual enzyme cleavage rules are not considered.

Hit Statistics

The percent TIC, mean error, data tolerance and mean number of missed cleavages are printed after an MS-Fit hit. If intensities aren't specified then the percent TIC value will be the same as the percent masses matched. The mean error is useful for diagnosing systematic errors in the results - indicating a calibration problem. The data tolerance is twice the standard deviation of the results and is the number that should be used as a tolerance parameter in the absence of systematic errors. This number is more reliable if there are a reasonable number of matching peaks (say 10). Also the number is only valid if all the matched peptides are real hits.