Instructions for General Features
Common to Multiple Protein Prospector Programs

Purpose

This document provides instructions for features found across more than one program in the Protein Prospector package.


Contents of this document:

Search Times

Search times vary from a few seconds to a few minutes depending on the computer hardware Protein Prospector is running on, the size of the database being searched, the restrictiveness of the search parameters and the number of searches being simultaneously performed.

When two or more searches are being performed simultaneously the searches slow noticeably. In general faster searches result with more discriminating search parameters: single species, narrow intact protein MW range, 0 missed cleavages. For DNA database searches set the intact protein MW filter to All.


Stopping / Cancelling a Search

Most of the Protein Prospector searches can be stopped by pressing an abort search button.


Output

This can currently be set to either HTML or XML output. Sometimes a tab delimited text option is also available. All formats can be saved to a file if this option is available.


Saving Hits from one Protein Prospector program, searching them with another

One Protein Prospector search program can serve as a pre-filter for another search program. To accomplish this the Hits (index numbers for matching database entries) from the first program are saved to a user specified file. This file is then retrieved by the second program, and only those matching database entries are searched by the second program.

The following programs can both save hits and search saved hits:

  • MS-Fit
  • MS-Fit Upload
  • MS-Tag
  • MS-Seq
  • MS-Pattern
  • MS-Homology
  • DB-Stat

The following programs can use saved hits:

  • MS-Digest
  • MS-Bridge
  • MS-NonSpecific
  • Batch-Tag
  • Batch-Tag Web
  • FA-Index

You can also use the save hits to file option to create disk files of the HTML and XML outputs.


Protein Prospector programs search sequence databases which are located locally on the server running the programs. The actual files searched are FASTA formatted copies of the source database which contain minimal annotation. Search output typically contains a web-link to a fully annotated version of the source database for each entry matched. Database search programs allow the selection of multiple databases one of which can be a user protein where the user can paste the protein sequences in the User Protein Sequence field.

Protein Prospector programs currently allow searching of the publicly available Genome and Proteome databases listed below. However, nearly any sequence database in a suitable FASTA format can be set up for use by contacting the administrator of a Protein Prospector server.

Protein Databases

  • NCBInr: Current README file
    A non-redundant database compiled by NCBI by combining most of the public domain databases (EST's not included).
  • Genpept: Current Release Notes
    Protein translation of Genbank (EST's not included).
  • Swiss Prot
    A curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc), a minimal level of redundancy and high level of integration with other databases
  • UniProtKB
    Merger of the information contained in Swiss-Prot, TrEMBL and PIR to produce a comprehensive database. All entries are highly annotated, some manually (Swiss-Prot and PIR) whilst other in an automated fashion using sequence similarity to previously annotated proteins (TrEMBL).
  • IPI
    Single species databases from species whose genome has been sequenced. Content includes protein entries in the UniProtKB database, plus predicted protein sequences from Ensembl and RefSeq.
  • Owl
    OWL is a non-redundant composite of 4 publicly available primary sources: SWISS-PROT, PIR (1-3), GenBank (translation) and NRL-3D. SWISS-PROT is the highest priority source, all others being compared against it to eliminate identical and trivially different sequences. OWL has not been maintained since 1999. However it is still available and Protein Prospector can search it.
  • Ludwignr
    The Ludwignr database is a non-redundant database made up from a number of another databases. The component databases can be downloaded individually or combined together by concatenation. The component databases are searched in the following order and duplicates are eliminated: Swiss-Prot, Trembl, Trembl-New, Genpept-updates, Genpept, Yeastpep, Wormpep. It is updated weekly.
    Also included are varsplic databases, Swiss-Prot Varsplic and Trembl Varsplic, which are a collection of isomeric proteins not really recorded in any of the other databases except for the SwissProt database in the (FT) features section. This new protein set, even though it is constructed from the original entry, can end up as a vastly different sequence with changes to amino acids, subtractions of segments and even addition of large segments of the sequence anywhere in the chain.

DNA Databases

Reasons to search particular databases:

  • UniProtKB
    Almost as comprehensive as NCBInr, it is well annotated and has significantly less protein redundancy than NCBInr.
  • NCBInr
    Largest protein database and updated most frequently.
  • Swiss Prot
    Smallest and best annotated.
  • dbEST
    No matches in protein databases, so gene for your protein may not yet be cloned. Perhaps an EST is known which contains part of your protein. Search times will typically be longer because of multi-frame translation combined with the fact that the dbEST file is > 3x larger than the NCBInr file.

Reasons NOT to search particular databases:

  • Owl
    No longer maintained.

The local copy of the database being searched with the programs is subject to updating by the administrator of a Protein Prospector server.


If you don't know the Latin taxonomic name for the species you're interested in try: NCBI Taxonomy Browser

Taxonomy filtered searches in Protein Prospector programs are performed by means of preliminary filtering of a database according to the user designated taxonomy or taxonomy collection. This taxonomy pre-filter is bypassed when the taxonomy is designated as All.

It is possible to select more than one item from the taxonomy menu as long as one of the selections isn't All.

The taxonomy pre-filtering is imperfect because of the poor usage of taxonomy (standard species naming conventions) in the databases, AND the poorly standardized location of this information in the FASTA database formats used by Protein Prospector programs.

Users who desire additional/changed taxonomy filtering capability should direct their local Protein Prospector Server administrator to the instructions To Add/Change Taxonomy Filter. For the World Wide Web version of Protein Prospector please send email to: Prospector email.

If you use DB-Stat or MS-Pattern with the Pre-search only option checked then the pre-search results will display the number of database entries for each species/sub-species identified in the taxonomy pre-search.


Checking taxonomy remove removes the selected taxonomy from the search.


This option allows you to search one or more taxonomies that are not available on the Taxonomy menu. Any identifier from the NCBI Taxonomy Browser can be used along with organism codes from the UniProt Knowledgebase. Also some common names are supported.

This item overrides any selections on the Taxonomy menu.

If you use DB-Stat or MS-Pattern with the Pre-search only option checked then the pre-search results will display the number of database entries for each species/sub-species identified in the taxonomy pre-search.


Intact protein MW limited searches in Protein Prospector programs are performed by means of preliminary filtering of a database according to the user designated intact protein MW. This pre-filter is bypassed when the MW range checkbox All is checked.

The intact protein MW pre-filtering is imperfect because sequences in protein databases often exist in pre, pro, and fragment forms. Sequences in DNA databases often exist as fragments (EST's) or as cDNA's.

Protein Prospector programs ALWAYS calculate the intact protein MW, according to the following constraints.

  1. Treat protein as uncharged.
  2. Use average mass scale.
  3. Treat amino acid C as unmodified.
  4. Treat amino acid X as leucine.
  5. Treat amino acid B as glutamic acid.
  6. Treat amino acid Z as glutamine.
  7. Treat amino acid U as Selenocysteine.
  8. Ignore amino acids J, O.

The molecular weight pre-search option is not available for DNA databases.


Intact protein pI limited searches in Protein Prospector programs are performed by means of preliminary filtering of a database according to the user designated intact protein pI. This pre-filter is bypassed when the pI range checkbox All is checked.

The intact protein pI pre-filtering is imperfect because sequences in protein databases often exist in pre, pro, and fragment forms. Sequences in DNA databases often exist as fragments (EST's) or as cDNA's.

Protein Prospector programs ALWAYS calculate the intact protein pI, according to the following constraints.

  1. Treat amino acid C as unmodified.
  2. Treat amino acid X as leucine.
  3. Treat amino acid B as glutamic acid.
  4. Treat amino acid Z as glutamine.
  5. Treat amino acid U as selenocysteine. However the correct coefficients aren't currently known.
  6. Ignore amino acids J, O, U.

The pI pre-search option is not available for DNA databases.

The pK values used to calculate the pI values can be modified by Protein Prospector server administrators. You must remake the database index files using FA-Index if you change the pK values.


The name pre-filter examines the name field of each database entry's comment line for one or more case-insensitive regular expressions.

Name pre-filtering is not performed if the field is left blank.


The accession number pre-filter can be used to specify a list of database entries on which to perform the search. If you are searching multiple databases (User Protein doesn't count as an extra database here) then you need to specify the databases that the accession numbers are from as shown below:

>NCBInr.2011.10.06
30377
13537138
>SwissProt.2011.10.06
Q99456
Q2M2I5

If accession number pre-filtering is in use then other pre-filters, such as name or intact molecular weight filtering, are disabled for the database for which accession numbers are specified. Thus it is possible to add a few accession numbers from a second database to search in addition to a first database.


Sometimes you may know that your sample contains proteins from a particular species but that other contaminant proteins such as keratin are also present.

This option allows you to enter a list of the accession numbers of the contaminant proteins and so include them in your search. If you are searching multiple databases (User Protein doesn't count as an extra database here) then you need to specify the databases that the accession numbers are from as shown below:

>NCBInr.2011.10.06
30377
13537138
>SwissProt.2011.10.06
Q99456
Q2M2I5


This option allow you to remove a list of accession numbers from a search. For example you could remove homologous proteins that don't contain any unique peptides. If you are searching multiple databases then you need to specify the databases that the accession numbers are from as shown below:

>NCBInr.2011.10.06
30377
13537138
>SwissProt.2011.10.06
Q99456
Q2M2I5

For a protein database you should enter a list of accession numbers or index numbers (one per line).

For example:

P38706
P05756
P48589
P26782
P04456
P02303

For a DNA database you should enter a list of:

(Accession Number/Index Number) (DNA Reading Frame) (Open Reading Frame)

For example:

113988 6 3
113989 4 1

DNA reading frames and open reading frames are described in the Frame Translation in DNA databases section.


DNA databases can NOT be searched with mass spectrometry data from DNA samples. Protein Prospector programs perform translation of DNA sequences to protein sequences.

Frames 1, 2, and 3, represent translation of the database sequence from left to right beginning in positions 1, 2, or 3 respectively. Frames 4, 5, 6 represent translation of the complement of the database sequence from right to left beginning in positions 1, 2, or 3 respectively.

Frame translation in Protein Prospector programs can be designated in 1, -1, 3, -3 or 6 frame translation modes. Frame mode 1 considers only frame 1 described above whereas frame mode -1 considers only frame 4. Frame mode 3 considers only frames 1, 2 and 3 whereas frame mode -3 considers only frames 4, 5 and 6. Frame mode 6 considers all 6 frames. A user should select frame mode 6 unless he/she knows that the database being searched contains sequences exclusively cloned in one direction or contains known genes with sequences already in frame.

Since the capability of searching DNA databases was intended to use EST databases, translation initiation does not require a start codon. If a stop codon is encountered the polypeptide is terminated. Translation is then reinitialized and continued with the following codon, thus beginning a new open-reading frame. MS-Fit requires all matches to a particular database entry to belong not only to the same translational frame, but also to the same open reading frame. Users who feel any of these procedures are inappropriate or inadequate, are urged to contact Prospector email. Implementation of these procedures was done with significant uncertainty as to optimal strategy.

An example is entry 24570 from dbEST human database.

The sequence in the database is:

GGCATTAAGTTGGGCTGGTTCTAGTAATAGTTCAATGGCACAGTTTTCACGTCAAATGCTTGGTTTAGCACCAGCTATCG
GGCCAGTTGTTACTGCTTTGGGTGGATTTATTACCAACGCTGGGAAGATTACTGGACTGGTAAAAGGTATGGGTTCAGCA
GTTATTGGTGCTGGTAAGGGAATGGTTAGTTTTGTTGCTAGATTGTTTGGAATGGCAGCAGGTAACACAGCGGTAGCAAC
ATCATCAACTGCTGCTGCAACTGGTACAAAAGCAGTTGGGG

This is translated in the different DNA reading frames as:

1:GIKLGWF...FNGTVFTSNAWFSTSYRASCYCFGWIYYQRWEDYWTGKRYGFSSYWCW.GNG.FCC.IVWNGSR.HSGSN
IINCCCNWYKSSW

2:ALSWAGSSNSSMAQFSRQMLGLAPAIGPVVTALGGFITNAGKITGLVKGMGSAVIGAGKGMVSFVARLFGMAAGNTAVAT
SSTAAATGTKAVG

3:H.VGLVLVIVQWHSFHVKCLV.HQLSGQLLLLWVDLLPTLGRLLDW.KVWVQQLLVLVREWLVLLLDCLEWQQVTQR.QH
HQLLLQLVQKQLG

4:PQLLLYQLQQQLMMLLPLCYLLPFQTI.QQN.PFPYQHQ.LLNPYLLPVQ.SSQRW..IHPKQ.QLAR.LVLNQAFDVKT
VPLNYY.NQPNLM

5:PNCFCTSCSSS..CCYRCVTCCHSKQSSNKTNHSLTSTNNC.THTFYQSSNLPSVGNKSTQSSNNWPDSWC.TKHLT.KL
CH.TITRTSPT.C

6:PTAFVPVAAAVDDVATAVLPAAIPNNLATKLTIPLPAPITAEPIPFTSPVIFPALVINPPKAVTTGPIAGAKPSI.RENC
AIELLLEPAQLNA

So 1 is GGC = G, 2 is GCA = A, 3 is CAT = H, 4 is GGG = P, 5 is GGG = P, 6 is GGT = P

The dots represent putative stop codons which delimit the open reading frames.

You can post the sequence as a User Protein in MS-Digest (even in lower case and with the numbers there) and it will report all the open reading frames for all DNA reading frames. Eg you could paste:

        1 ggcattaagt tgggctggtt ctagtaatag ttcaatggca cagttttcac gtcaaatgct
       61 tggtttagca ccagctatcg ggccagttgt tactgctttg ggtggattta ttaccaacgc
      121 tgggaagatt actggactgg taaaaggtat gggttcagca gttattggtg ctggtaaggg
      181 aatggttagt tttgttgcta gattgtttgg aatggcagca ggtaacacag cggtagcaac
      241 atcatcaact gctgctgcaa ctggtacaaa agcagttggg g

If you enter a number here then it limits the number of amino acids that are considered from the N-terminus of all the proteins searched.


Enzyme specificity / Missed cleavages

The termini of the matched peptides can be set to be consistent with the cleavage specificity of the enzyme used to generate the peptide. By selecting No enzyme (not available in MS-Fit, MS-Digest or MS-Bridge) the matched peptides have no constraint on their termini. Increasing the number of maximum number of missed cleavages allowed enables matching to sequences with uncleaved sites internal to the peptide.

The option for the non-existent enzyme Slymotrypsin was created as a means for allowing Chymotryptic cleavages in Trypsin digests. When using this choice it is important to increase the missed cleavages allowed. Increasing to 9 will result in only a marginal increase in the search time.

Protein Prospector server administrators can edit the existing enzyme cleavage rules or add new ones.

It is possible to combine the rules for two or more enzymes by adding options to the Enzyme item on the HTML form. N-terminal cleavage rules can thus be mixed with C-terminal ones.


The Non-Specific options can be used to relax the enzyme cleavage rules at one or more of the peptide termini:

at 0 termini - The normal cleavage rules are followed;

at 1 termini - The cleavage rules are relaxed at the either the N or C terminus but not both at the same time;

at 2 termini - This is similar to No enzyme except that the missed cleavage option is considered;

at N termini - The cleavage rules are relaxed at the N terminus;

at C termini - The cleavage rules are relaxed at the C terminus.

N termini-1=D - Unlike the other options above, the selected enzyme specificity is ignored at the N-terminus and instead is fixed to be a cleavage after D; this is used for work with caspase or similar substrates. This was necessary in order to implement an enzyme with different specificities at the N- and C-termini.


The End Terminus parameter selects end terminal processing of the digest fragments. The stripping terminus parameter is used to specify which terminus the amino acids are stripped from. The stripping range specifies the range of the number of amino acids which are cleaved off.

For example, Carboxypeptidase Y is a C-terminal cleavage enzyme and Aminopeptidase cleaves peptides from the N-terminus. Either of these could be combined with any other enzyme that clipped proteins into peptides, such as trypsin.

It is possible to see a mixture of peptides where some having zero perhaps up to four of the N-terminal (with aminopeptidase) or C-terminal (with carboxypeptidase) amino acids cleaved off. Therefore, a choice to accomodate this by including 0 to 4 is appropriate.

This feature can also be used to predict the fragments produced by in-source decay.


This option allows you to specify which amino acids are in all the peptides which are reported. If you select AND then all the specified amino acids have to be present. If you select OR then just one of the selected amino acids has to be present.


For database search programs to operate on a user supplied sequence:

1). select User Protein as the Database option;

2). paste or type the sequence(s) in the User Protein Sequence box;

    Tabs, returns, and spaces are ignored.

    USE CAPITAL LETTERS for the amino acids. Any upper case letter is allowed.

    Certain characters will be automatically removed from the sequence such as number, '*' and '/' characters. This is to facilitate pasting sequences from other packages.

    DNA sequences are not currently supported.

    If your sequence is all in lower case letters Protein Prospector will try to convert it to upper case.

    You can use 3-letter amino acid codes. However the program will only attempt to interpret the sequence as containing 3-letter codes if it can't interpret it any other way. Thus the first letter of the 3 letter code should be upper case and the other 2 letters lower case.

    You can enter multiple proteins by separating the sequences by a line starting with a > character.

    For example:

    MPPKRAALIQNLRDSYTETSSFAVIEEWAAGTLQEIEGIAKAAAEAHGTIRNSTYGRAQAEKSPEQLL
    GVLQRYQDLCHNVYCQAETIRTVIAIRIPEHKEEDNLGVAVQHAVLKIIDELEIKTLGSGEKSGSGGA
    PTPIGMYALREYLSARSTVEDKLLGSVDAESGKTKGGSQSPSLLLELRQIDADFMLKVELATTHLSTM
    VRAVINAYLLNWKKLIQPRTGTDHMVS
    >
    RVCMGKSQHHSFPCISDRLCSNECVKEEGGWTAGYCHLRYCRCQKAC
    

    All characters after the > character are taken as the protein name.

    For example:

    >Protein 1
    MPPKRAALIQNLRDSYTETSSFAVIEEWAAGTLQEIEGIAKAAAEAHGTIRNSTYGRAQAEKSPEQLL
    GVLQRYQDLCHNVYCQAETIRTVIAIRIPEHKEEDNLGVAVQHAVLKIIDELEIKTLGSGEKSGSGGA
    PTPIGMYALREYLSARSTVEDKLLGSVDAESGKTKGGSQSPSLLLELRQIDADFMLKVELATTHLSTM
    VRAVINAYLLNWKKLIQPRTGTDHMVS
    >Protein 2
    RVCMGKSQHHSFPCISDRLCSNECVKEEGGWTAGYCHLRYCRCQKAC
    

3). set the other parameters as appropriate;

4). press the Start Search button.


For MS-Digest/MS-Bridge/MS-NonSpecific to operate on a user supplied sequence:

1). select User Protein as the Database option;

2). paste or type the sequence in the User Protein Sequence box;

    Tabs, returns, and spaces are ignored.

    USE CAPITAL LETTERS for the amino acids. The following lower case letters can be used:
    s,t,y - Phosphorylated S,T,Y
    d,e,f,g,h,i,j,k,l - user specified amino acids

    Use U for selenocysteine.

    Do NOT use the letters B, J, O, X, or Z.

    Certain characters will be automatically removed from the sequence such as number, '*' and '/' characters. This is to facilitate pasting sequences from other packages.

    If your sequence over 85% ACG and T and is greater than 10 characters in length then it is deemed to be a DNA sequence and treated accordingly.

    If your sequence is all in lower case letters Protein Prospector will try converting it to upper case. d, e, f, g, h, i, j, k, l, s, t and y can't be used in conjunction with an all lower case sequence.

    You can use 3-letter amino acid codes. However the program will only attempt to interpret the sequence as containing 3-letter codes if it can't interpret it any other way. d, e, f, g, h, i, j, k, l, s, t and y can't be used in conjunction with 3 letter codes.

    You can specify more than 1 user protein by separating the sequences by a line starting with a > character.

    For example:

    MPPKRAALIQNLRDSYTETSSFAVIEEWAAGTLQEIEGIAKAAAEAHGTIRNSTYGRAQAEKSPEQLL
    GVLQRYQDLCHNVYCQAETIRTVIAIRIPEHKEEDNLGVAVQHAVLKIIDELEIKTLGSGEKSGSGGA
    PTPIGMYALREYLSARSTVEDKLLGSVDAESGKTKGGSQSPSLLLELRQIDADFMLKVELATTHLSTM
    VRAVINAYLLNWKKLIQPRTGTDHMVS
    >
    RVCMGKSQHHSFPCISDRLCSNECVKEEGGWTAGYCHLRYCRCQKAC
    

    All characters after the > character are taken as the protein name.

    For example:

    >Protein 1
    MPPKRAALIQNLRDSYTETSSFAVIEEWAAGTLQEIEGIAKAAAEAHGTIRNSTYGRAQAEKSPEQLL
    GVLQRYQDLCHNVYCQAETIRTVIAIRIPEHKEEDNLGVAVQHAVLKIIDELEIKTLGSGEKSGSGGA
    PTPIGMYALREYLSARSTVEDKLLGSVDAESGKTKGGSQSPSLLLELRQIDADFMLKVELATTHLSTM
    VRAVINAYLLNWKKLIQPRTGTDHMVS
    >Protein 2
    RVCMGKSQHHSFPCISDRLCSNECVKEEGGWTAGYCHLRYCRCQKAC
    

    If multiple proteins are entered the Separate Proteins option can be used either to indicate whether or not you want a separate section in the report for each protein.

3). set the other parameters as appropriate.

4). press the Perform Digest button.


The links in program output are intended to easily facilitate user access to obvious sources of additional information about proteins or peptides matched or under study. Some of the default parameters of these links can be changed by Protein Prospector server administrators.

change the default parameters in the HTML links from the accession number
change the default parameters in the HTML links from the MS-Digest index number


The outputs from Protein Prospector programs usually contains links to other Protein Prospector programs and Internet pages (general features of links from program output). You can disable these links by checking the Hide HTML Links option. This will have the effect of considerably reducing the size of the output report and hence the network traffic.


The database accession number in the search results has an HTML link to retrieve the complete entry including comments from a remote database. In order for this link to be created the programs need to know the URL for the remote database. Users who desire links to different fully annotated databases, or who find links to a particular database to be defective should contact their local Protein Prospector server administrator. For the World Wide Web version of Protein Prospector please send email to: Prospector email.

Server Administrators can change the default address of links from accession numbers in program output without requiring access to Protein Prospector source code. Those administrators who find improved options for links to publicly available databases are encouraged to send the modified parameter files to Prospector email for inclusion in subsequent Protein Prospector releases.


The MS-Digest index number in the search results has an HTML link to retrieve a listing of all the masses and sequences of peptides that can be produced by digesting the matched protein with the designated enzyme. If No enzyme was designated in the search parameters, then Trypsin is supplied in this HTML link. The number of missed cleavages is set to 2 unless a higher number was designated in the search parameters.

Server administrators can change the HTML link from the MS-Digest index number in the search results.

If the MS-Digest number link marked Coverage Map in the MS-Fit detailed results is pressed then the protein display at the top of the MS-Digest report has the matching peptides highlighted.


The peptide sequence in the search results has an HTML link to MS-Product for retrieving a listing of the theoretical fragment-ions that may be formed in an MS/MS experiment. The default set of ion types supplied in this link corresponds to those expected to be formed in post-source decay (PSD) experiments.


Some Protein Prospector programs allow the peptide terminal groups to be modified from the defaults of hydrogen at the N terminus and free acid at the C terminus.

Users who desire additional options for terminal groups should contact their local Protein Prospector server administrator. For the World Wide Web version of Protein Prospector please send email to: Prospector email.

Server Administrators can add terminal groups without requiring access to Protein Prospector source code. Those administrators who add terminal groups are encouraged to send the modified parameter files to Prospector email for inclusion in subsequent Protein Prospector releases.


Any of the 20 standard amino acids can be modified in a user designated way although this option will generally be used to modify cysteine residues. Peptide terminal groups can also be changed from the defaults of hydrogen at the N terminus and free acid at the C terminus. This feature is used for quantitation methods such as iTRAQ.

It is an error to specify more than one constant modification for a single amino acid or terminal group.

Users who want additional options for constant amino acid modifications should contact their local Protein Prospector server administrator. For the World Wide Web version of Protein Prospector please send email to: Prospector email.

Server Administrators can add constant modification options without requiring access to Protein Prospector source code. Those administrators who add constant modification options are encouraged to send the modified parameter files to Prospector email for inclusion in subsequent Protein Prospector releases.

Notes on Cysteine Modification

Carboxymethylation is the product of a reaction with iodoacetic acid, carbamidomethylation is the product of a reaction with iodoacetamide and pyridylethylation is the product of a reaction with vinylpyridine.

In every case we would assume that all cysteines are modified by the addition of the appropriate group, eg for carboxymethylation an H is replaced by CH2COOH for every cysteine, i.e. a nominal mass increase of 58 Da per cysteine.

There are miscellaneous reasons for people choosing particular alkylating agents, including the ease of carrying out the reaction, the efficiency and yield of the reaction, the desire to add relatively small mass increments per cysteine, changes in the properties of the protein and its peptides, etc. Acrylamide modification usually means that there was no deliberate attempt at alkylation before running a gel. Iodoacetic acid and iodoacetamide are both convenient and easy reagents to work with, that react with high yield to add a well-defined mass increment. The acid makes a protein more hydrophilic and tends to open up its structure to more efficient digestion. Other reagents may be more problematical but may offer particular advantages, e.g. vinylpyridine is not water soluble so the reaction is carried out in an organic solvent. This may be more effective for hydrophobic proteins, e.g. membrane proteins. Cyanylation and subsequent cleavage has been developed for identification of multiple bridged cysteines. Then there are the various reagents that add a tag such as biotin to assist with separation, and ICAT that combines this with isotopic labelling.

More than one method of modification (mixing) can NOT generally be designated at the same time for a single search. There is one exception to this rule in the MS-Fit program where it is possible to consider Acrylamide Modified Cys in addition to the selected cysteine modification (Modifying Amino Acids).


Modifications selected here will be searched both as if the modification is present or absent. To select multiple modifications hold down the 'Ctrl' ( '?' on a Macintosh ) keys and click the modifications you would like to add. Similarly, to deselect modifications, hold down 'Ctrl' or '?' and click on the modification name.

To deal with the increasing number of variable modifications some forms now have + and - buttons to add and remove modifications from the variable modifications menu. If this is the case the modifications are divided into 6 different categories: Frequent; Unusual; Glycosylation; Quant SILAC; Quant Others and Crosslinking. First select the appropriate category and then select the modification you want. To add the modification to the menu click the + button, to remove it click the - button. Note that rather than remove a modification from the menu you could alternatively deselect it if you don't want to include it in the search.

Next to the + and - button is another menu where the options are Common, Rare, Label 1, Label 2, Label 3 and Max 1, Max 2, etc up to 1 less than the Max Mods option. Only a single Rare modification is allowed in a peptide. The Max options override the Max Mods setting for a particular modification. If Common is set then the Max Mods options is used for that modification. The Label options are used to process quantitation data. 2 modifications with different label numbers can't occur in the same peptide. Also a modified amino acid with a label number can't occur in the same peptide with the same unlabelled amino acid (modified or unmodified).

The menu to the right of this has the options All, Protein, Peptide and Site. This menu is relevant if the selected protein sequence database has an associated site modification database. A site modification database is used to relate particular positions within the protein sequence with modifications, such as Phosphorylation, or groups of modification, such as N-Glycosylation. The meaning of the various options is as follows:

All - All potential sites are considered. This is the standard setting.

Protein - The modification is only considered for proteins in which it occurs in the site database.

Peptide - The modification is only considered for peptides in which it occurs in the site database.

Site - The modification is only considered for sites in which it occurs in the site database.

A Motif can also be specified for a variable modification. If you don't want to specify a Motif then select the Off option. The other options on the Motif menu specify an offset of the modification site relative to the motif. For example for N glycosylation modifications you could set the offset to zero and use the motif N[^P][ST] (meaning N[notP][SorT]. For hydroxyproline modifications you could use the motif PG. Another example would be to use a motif to target the phosphorylation sites of specific kinases.

If a modification is designated as Uncleaved this means it can't occur at a digest cleavage site.

Server Administrators can add variable modification options without requiring access to Protein Prospector source code. Those administrators who add variable modification options are encouraged to send the modified parameter files to Prospector email for inclusion in subsequent Protein Prospector releases.

In addition to the modifications specified on the Variable Mods menu users can specify their own variable modifications. You can either specify the mass shift as an exact mass or an elemental composition. If you specify an elemental composition it is also necessary to specify a label by which the modification is labelled in the results. These naming guidelines should be followed when choosing a label. It is particularly important not to have unmatched brackets. A specificity for the modification must also be selected from the Specificity menu. You must further specify a Common, Rare' Label, Max and Motif parameters (these terms are explained in the Variable Modifications section). You should avoid selecting a User Defined Variable Modification when you have already selected it on the Variable Mods menu unless the specificity is different.

The Max Mods parameter restricts the number of variable modifications that can occur in a single peptide. Setting this too high can have a big impact on search time, particularly if a lot of variable modifications are selected and/or a wide precursor tolerance is used.

MS-Tag/Batch-Tag first use a precursor mass filter to identify candidate peptides. If variable modifications are also selected then a match will be to a given peptide with a given combination of modifications. Sometimes there are multiple amino acid sites on which these modifications could be located and Protein Prospector considers each permutation of these sites separately. When looking for certain modifications with highly charged data then the number of sites can get very large indeed with a dramatic impact on the search speed. One modification where this is a particular problem is phosphorylation.

If the number of permutations of variable modifications in a peptide exceeds the value of the Max Peptide Permutations parameter then the peptide is skipped in the search and won't appear in the results.

If this parameter is left blank then all permutations are considered.


MS-Fit allows for a specialized set of modified amino acids:

Peptide N-terminal Gln to pyroGlu

Any instance of Glutamine at the N-terminus of a peptide (following digestion) is considered as either normal Gln or as pyro-glutamic acid.
Designation: Q -> q

Oxidation of M

Any instance of Methionine is considered as either normal Met or Met + oxygen.
Designation: M -> m

Protein N-terminus Acetylated

For any database entry with a Met at the N-terminus the N-terminal peptide is considered as either in its original form or in a form where the Met is removed and the next amino acid is acetylated. While this post-translational modification does not occur in bacteria, MS-Fit and MS-Digest don't know any better. Furthermore, if the database curators have removed the N-terminal Met from the sequence, then MS-Fit and MS-Digest will not apply the acetylation modification.

Acrylamide Modified Cys

Any instance of Cysteine is considered as either the Cysteine modification chosen on the Cys modified by: option or acrylamide modified Cys. This option would normally be used to consider each Cysteine as either unmodified or acrylamide modified.

User Defined 1/User Defined 2/User Defined 3/User Defined 4

Up to four of the considered modifications can be selected from a list of user defined modifications which a server administrator can add to. For example if Phosphorylation of S, T and Y is chosen from the list then any instance of Serine, Threonine, or Tyrosine is considered as either normal Ser, Thr, Tyr or phosphorylated Ser, Thr, Tyr.
Designation: S -> s, T -> t, Y -> y


Some Prospector forms have options for entering elemental compositions. An example of a suitable entry is shown below. If the entry is not of this format it will be rejected.

C138 H239 N34 O47 13C12 15N4

1). Only elements defined in the elements.txt file can be used. This file can be edited by a systems administrator. An element can either be a single capital letter (eg C) or a capital letter followed by a lower case letter (eg Na).

2). The isotopes 2H, 18O, 15N and 13C are included in the default elements.txt file. You can use D instead of 2H.

3). The number following the element is the number of atoms of that element. This number is not required in the case of a single atom. In cases where you are defining a modification the number can be negative. Eg:

N H O-1

4). Isotopes or elements in the formula must be separated by a single space. It is thus not acceptable to enter something like C2H3N1O1.


Some Protein Prospector programs allow the use of user specified amino acids for which you must supply the elemental compositions. To specify the user defined amino acid in a peptide or protein sequence use the appropriate letter (lower case d, e, f, g, h, i, j, k or l). The default elemental composition for all the user defined amino acids is that of glycine.


Protein Prospector programs expect the mass input values to represent the actual m/z values measured on a mass spectrometer. Thus protons - H+ (other charging agents are not currently allowed), need not be subtracted. However, input data that has had the mass of the protons subtracted can be used; simply designate the charge as 0.


Monoisotopic: only the lowest common isotope for each element is used in the mass calculations 12C, 1H, 14N, 16O, 32S, 31P.

Average: All isotopes for each element are used and with their abundances reflecting their "normal" proportion in the biosphere. The isotope abundances can be changed by editing the elements.txt file.

Par(mi)Frag(av): Parent masses are calculated as monoisotopic and fragment masses are calculated as average. The Par(mi)Frag(av) option should be chosen when the mass accuracy on fragment mass measurements is modest ( +/- 1000 ppm ).

Par(av)Frag(mi): Parent masses are calculated as average and fragment masses are calculated as monoisotopic. The Par(av)Frag(mi) option should be chosen when the mass accuracy of the parent mass measurement is modest ( +/- 1000 ppm ).


Protein Prospector programs can handle multiply charged data from both positive and negative ion experiments. Simply specify the integer charge state corresponding to the m/z value. Absence of charge specification in the input defaults to a charge state of +1. Input data that has had mass of the protons subtracted can be used; simply designate the charge as 0. The charge is used to convert the m/z value to an MH+ value for search purposes. Output will show the m/z value with the charge as a superscript.


If the precursor charge is set to Automatic then the charge from the peak list is used. This can be overriden by specifying a charge on the menu.


Some peaklist generation software does not assign a charge state to precursor ions. If no charge is listed, the data will be searched using the entire list of charge states selected in the Precursor Charge Range field.


The mass tolerances should be set to be consistent with the mass accuracy of the instrument used to generate the data. It is generally a better idea to use units of ppm or % rather than Da, as mass spectrometers typically have an error associated with mass measurement that is mass dependent and thus cannot be uniformly expressed in Da.

Measuring masses as accurately as possible is the single most important thing one can do to achieve the highest certainty of protein identification in a peptide mass fingerprinting experiment.


Sometimes the calibration of data can contain a systematic mass error. For example, all precursor masses may have errors between +60 to +100ppm. In this situation, searching the data with an 80ppm systematic error and a MS/MS Parent Tolerance of 20ppm would give more reliable results without losing any correct answers. The units of systematic error will be the same as for the Parent Tol.


Two types of data set are used in Protein Prospector. The programs MS-Fit, MS-Bridge and MS-NonSpecific use MS data sets and the programs MS-Product, MS-Seq MS-Tag and Batch-Tag use MS/MS data sets.

If use either the M/Z Charge or M/Z Intensity Charge format then you are specifying the format of a single data point on a single line of an MS or MS/MS file. If the charge is not specified for a data point then it defaults to 1. Thus if you specify an M/Z Intensity Charge format for an M/Z Charge data set then the charges will be incorrectly read as intensities. An MS data set consists of a list of ion measurements - one per line. An MS/MS data set consists of a precursor ion measurement on one line followed by a list of fragment ion measurements each on separate lines. Intensities should not be specified for the precursor ion measurement for MS/MS data. You can include multiple spectra by separating them by a line just containing a '>' character.

Industry standard formats mgf, pkl and dta may also be specified for MS/MS data.


Protein Prospector currently has three options for data sources:

1). Enter a list of files each containing a single data set. These files must be on the same computer file system as the Prospector software so this option is not available for people using one of our web sites. This option works differently depending on whether you are using a UNIX or Windows version of the software. This data source is used on the Batch MS-Fit and Batch MS-Tag forms.

2). Select a file containing the data from the local disk using the Browse button. The file must be a text file created using a program such as Windows Notepad. It can contain multiple datasets which should be separated by a > character. This data source is used on the MS-Fit Web Batch and MS-Tag Web Batch forms.

3). Paste the data into the data paste area. Multiple data sets separated by > characters can be entered into the paste area as shown above. This data source is used on the standard MS-Fit and MS-Tag forms.


Windows

If the files are all in the same directory then this directory should be specified in the Data File Directory item. The file names themselves should be specified one per line in the Data Files item. If the files are in different directories then the Data File Directory item should be left blank and the full path of each file should be specified in the Data Files item.

On Microsoft Windows systems you can use wildcard characters to specify more than one file at a time. The following rules thus apply when specifying filenames:

a). A filename can contain up to 255 characters, including spaces. But, it cannot contain any of the following characters:

\ / : * ? " < > |

b). Wildcard characters can represent one or more characters. The question mark (?) wildcard can be used to represent any single character, and the asterisk (*) wildcard can be used to represent any character or group of characters that might match that position in other filenames.

c). Capitalization doesn't matter.

The Data Files option can contain more than one wild carded filename.

Specifying a list of files in this way is only possible for MS-Fit and MS-Tag.

UNIX

The files must all be in the same directory and this directory should be specified in the Data File Directory item. The file names or regular expressions representing groups of files should be specified one per line in the Data Files item.

The regular expressions which can be used to represent a group of file names are described in the regular expressions section.

Capitalization matters.

The Data Files option can contain more than filename regular expression.

Specifying a list of files in this way is only possible for MS-Fit and MS-Tag.


data set 1
>
data set 2
>
data set 3

Note that for security reasons the Upload Data From File item is cleared once you submit the form so you have to reselect the file before you can do a subsequent search. This is a function of the web browser rather than Protein Prospector.

Only a single data set can be specified for MS-Seq, MS-Bridge and MS-Product.

Paste the data into the data paste area. Multiple data sets separated by > characters can be entered into the paste area as shown above.


This option is used to limit the maximum number of hits displayed. For example if the maximum number of reported hits is set to 50 and there are 100 hits then only the first 50 hits are displayed.


The search is aborted if this parameter is exceeded.


This option allows a user defined comment or sample identifier to be added the output.


This option is used for controlling the number of amino acids from the protein before the hit peptide that are displayed.


This option is used for controlling the number of amino acids from the protein after the hit peptide that are displayed.


Searches can be restricted to matching sequences containing particular amino acid(s) by checking the appropriate boxes. This information can be derived from the masses of immonium and related low-mass ions or high-mass ions indicating side-chain losses from the parent ion. The programs do not actually use the mass values but instead filter the matched sequence for the presence of the designated amino acid(s).

In MS-Tag the masses of immonium and related low-mass ions can also be placed directly in the fragment-ion mass window. MS-Tag invokes the same rules as conveyed in the check box chart, and converts the masses to AA characters and filters matched sequences as above for presence of the described amino acid(s). Protein Prospector server administrators can control these immonium ion rules by editing the immonium.txt file.


MS-Comp considers the 20 naturally occurring amino acids as a default. If you know that your unknown peptide doesn't contain particular amino acids you can narrow the range of the search by excluding them. You might also wish to exclude either Leucine or Isoleucine.


MS-Comp considers the 20 naturally occurring amino acids as a default. They can also optionally include the following:

  • d - A User Specified Amino Acid
  • e - A User Specified Amino Acid
  • f - A User Specified Amino Acid
  • g - A User Specified Amino Acid
  • h - A User Specified Amino Acid
  • i - A User Specified Amino Acid
  • j - A User Specified Amino Acid
  • k - A User Specified Amino Acid
  • l - A User Specified Amino Acid
  • m - Oxidized Methionine
  • s - Phosphorylated Serine
  • t - Phosphorylated Threonine
  • y - Phosphorylated Tyrosine
  • z - Homoserine Lactone

Some Protein Prospector parameters are specific to an instrument type. Server administrators can modify these parameters or add new instrument types by editing the instrument.txt file.


The regular expressions used are of the form used by the UNIX grep facility. Examples (type man grep on a UNIX system for full details):

[AB] The character is either A or B.
[A-IK-Y] the character is either alphabetically between A and I or K and Y
[^AB] The character is anything but A or B.
. Any single character is possible.
.* Used to represent a sequence of one or more unknown characters

Where stated the regular expressions are case insensitive.


Separate Proteins

If this is option is selected there is a separate section in the MS-Digest/MS-Bridge report for each protein.


Hide Protein Sequence

The complete protein sequence is normally displayed in the MS-Digest/MS-Bridge/MS-NonSpecific output. You can disable this display using the Hide Protein Sequence option.

This option is also available on the FA-Index Database Summary Report form. However the protein sequence is off by default for these reports.


It is possible to retrieve entries from the database by specifying either the Accession Number or the Index Number. The accession number is a unique identifier for a protein within the database. It will not change between subsequent revisions of the database and is external to the Protein Prospector package. The index number for a particular protein is internal to the Protein Prospector package and is likely to change when you update the database. Both the index number and the accession number are reported in Protein Prospector search results. Entries are generally more efficiently retrieved using index numbers.


Tick the display graph option if you want to include a spectrum in the output.


Contaminant Masses

A list of singly charged contaminant masses can be entered. Data peaks which are within the tolerance of the contaminant masses will be deleted from the data set before the search takes place. All charge states are considered.


13C%, 15N%, 18O%, 2H%

These parameters allow the abundances of the given stable isotope to be adjusted. Eg if you specify 95% 13C% then any 13C% in an elemental composition is replaced by 95% 13C% and 5% 12C%.


Glycosylation Filter

MS-Product has an option for separating glycosylation B and Y ions from normal peptide ions providing the fragment mass accuracy is sufficient (say better than 25 ppm). There is a menu on MS-Product for controlling this. You should also use the Match Z option when using any other option than No Filter.

No Filter

This is an appropriate setting for low resolution spectra or spectra without glycosylation ions.

Glyco B

This setting displays a spectrum with just glycosylation B ions as specified in the parameter file glyco_by.txt. The ones that match the proposed glycosylation are labelled. Other glycosylation peaks are not labelled unless the All Glyco checkbox is checked.

Glyco Y

This setting displays a spectrum with just glycosylation Y ions as specified in the parameter file glyco_by.txt. The ones that match the proposed glycosylation are labelled. Other glycosylation peaks are not labelled unless the All Glyco checkbox is checked. Y ions from different charge states are combined. The Y ions are displayed on a negative axis with Y0 as the zero point. If there are any Y ions that correspond with peptide ions these are shown on the positive axis. This is quite rare with sufficient mass accuracy.

Glyco B/Y

This setting combines the B and Y ions above into a single spectrum.

Glyco B/Y/Pep

This setting has both the glycosylation ions and the peptide ions. However the Max MSMS Peaks number is used on both sets of peaks separately.

No Glyco B/Y

This setting attempts to filter out the B and Y ions and just displays the remaining ions with the peptide ions labelled. Any glycosylation ions that aren't in the glyco_by.txt parameter file will still be shown.


Please give feedback, by sending e-mail to Prospector email.