ProteinProspector Server Administration

Purpose

This document provides instructions for common ProteinProspector administrative tasks on both UNIX and Microsoft Windows platforms.


  1. Obtain FASTA formatted sequence database files for the seqdb directory:

    Locations to download FASTA formatted database files via ftp:

    The Ludwignr database is a non-redundant database made up from several smaller databases contained in the directory ftp://ftp.ch.embnet.org/pub/databases/nr_prot. You need to download the ones you are interested in individually and then concatenate them together to make one file. To do the concatenation on the UNIX operating system you can use the cat command from the command line. For Windows NT one option is to download some Windows Explorer extensions which include a concatenation feature. http://www.funduc.com/explorer_extensions.htm.

  2. Uncompress and rename the database files according to the format: Genpept.##, Owl.##, SwissProt.##, NCBInr.##, dbEST.##, Ludwignr.##. The prefixes shown in italics (Genpept, Owl, SwissProt, NCBInr, dbEST or Ludwignr) are a necessary part of the name, which allow the software to differentiate the specific dialect of the FASTA format comment line used in each database. You may also use the corresponding lowercase prefixes gen, owl, swp, nr, or dbest. They can also be used for a second database that is of the same format as the uppercase one. If you want to know more details, please read the FA-Index manual, particularly the filenaming sections.

  3. Create indices in the seqdb directory for each database, by using the program. The indicies are necessary for preliminary filtering by species, protein MW and protein pI. FA-Index must be run after each update of a database, even if the update is done by only adding new entries to the end of the original file.

    If you really want to know what FA-Index does and why, please read the manual. Don't even think about trying to use proprietary databases or update databases daily, UNLESS you read the FA-Index manual, particularly the generic database filenaming sections.

    FA-Index will create a file with a .usp suffix (eg. Genpept.r95.usp) where it writes the comment line for each FASTA entry which the FA-Index program cannot parse out the species. Viewing this file can help troubleshoot FASTA format problems for anyone using proprietary databases.

  4. Update the database list on the HTML forms.


Edit the file: acclinks.txt

The database accession number in the search results has an HTML link to retrieve the complete entry including comments from a remote database. In order for this link to be created the programs need to know the URL for the remote database. This is accomplished through parameters contained in the acclinks.txt file. Occasionally the URL's to the remote database may need to be updated, or new ones added for a new database. This requires editing of the acclinks.txt file.

Within the acclinks.txt file an entry for an HTML link from the accession number MUST contain 1 line:

The line must contain the following information:

  1. The prefix name for the database as listed in the HTML input page for each program. The prefix should be long enough to uniquely identify the database or set of databases you wish to refer to.
  2. The URL to link to if the accession number for the entry is added to the end of the URL. The URL addition is internal to the programs and is expected to retrieve a fully annotated entry from a remote database.

    Note that this link need not be to a sequence database. The link could be to whatever a ProteinProspector server administrator specifies.

Example:

Below is an example of the entries for Genpept in acclinks.txt:

Genpept http://www3.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=p&form=6&dopt=ng&uid=
gen http://www3.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=p&form=6&dopt=ng&uid=

The lowercase prefixes gen, owl, swp, or nr are intended to be used for a second database that is of the same format as the uppercase one. See Linking for creating links into NCBI databases.

As mentioned above the prefix name can refer to a single database or a set of databases. For example if you have two user created databases called PA3_mouse and PA33_mouse, an entry in the acclinks.txt file of the form:

PA3 some_url_prefix

would give the databases the same accession number link. On the other hand entries of the following form:

PA3 some_url_prefix
PA33 another_url_prefix

would give the databases different accession number links.

ProteinProspector server administrators who find improved options for links to publicly available databases are encouraged to send the modified parameter files to for inclusion in subsequent ProteinProspector releases.


Edit the file: idxlinks.txt

The MS-Digest index number in the search results has an HTML link to retrieve an MS-Digest listing for the matched database entry. In order for this link to be created the programs need to know the URL to MS-Digest and some default parameters. This is accomplished through information contained in the idxlinks.txt file. A server administrator can customize these parameters by editing the idxlinks.txt file.

Within the idxlinks.txt file an entry for an HTML link from the MS-Digest index number MUST contain 2 lines:

The lines must contain the following information:

  1. The program name for which the specified HTML link will be created from the index number link in the program's output.
  2. The URL to link to if the enzyme, MS-Digest index number, and modified AA parameters (from MS-Fit only) for the entry are added to the end of the provided URL. The URL addition is internal to the programs and is expected to provide an MS-Digest listing for the database entry corresponding to the index number.

    Note that this link need not be the same for each ProteinProspector program creating the link, and that the MS-Digest parameters can be customized. Furthermore, this link need not be to MS-Digest at all; the link could be to whatever a ProteinProspector server administrator specifies.

Example:

Below is an example of the entries for msfit and mstag in idxlinks.txt:

msfit
MSDIGEST?
mstag
MSDIGEST?mod_AA=Peptide+N-terminal+Gln+to+pyroGlu&mod_AA=Oxidation+of+M&mod_AA=Protein+N-terminus+Acetylated


Edit the file: seqlinks.txt

The peptide sequence in the search results has an HTML link to retrieve an MS-Product listing for the peptide sequence. In order for this link to be created the programs need to know the URL to MS-Product and some default parameters. This is accomplished through information contained in the seqlinks.txt file. A server administrator can customize these parameters by editing the seqlinks.txt file.

Within the seqlinks.txt file an entry for an HTML link from the peptide sequence MUST contain 2 lines:

The lines must contain the following information:

  1. The program name for which the specified HTML link will be created from the peptide sequence link in the program's output.
  2. The URL to link to if the peptide sequence is added to the end of the URL. The URL addition is internal to the programs and is expected to provide an MS-Product listing for the peptide sequence. Parameters for PSD fragmentation are the defaults specified in seqlinks.txt.

    Note that the parameters need not be for PSD and the link need not be to to MS-Product. The link could be to whatever a ProteinProspector server administrator specifies. BLAST or other sequence homology search programs for example.

Example:

Below is an example of the entry for mstag in seqlinks.txt:

mstag
MSPROD?parent_mass_convert=average&it=i&it=m&it=a&it=b&it=y&it=I&it=h&it=n&it=B


Edit the file: species.txt

In order to limit searches to a particular species, or a collection of species, the programs have to correlate the species name selected in the HTML form with the species names in the database entries. This is accomplished through the species alias file species.txt.

There are three types of entry in the species.txt file:

         
  • single species entries
  •      
  • multiple species entries
  •      
  • excluded species entries

Within the species.txt file a single species entry must contain at least ONE line, species are separated by a line with only the ">" symbol.

Line 1 contains the species name as it appears in the HTML input files. Line 1 is only used to relate the HTML input entry to possible aliases in the databases. Every species in the HTML input pages should have an entry in this file. If one of those species is NOT present in any database leave the entry with only one line. No error message will be generated that way.

All other lines should contain names (aliases) by which the species may be found in the databases. The aliases can be in any order.

Examples:

>
HELICOBACTER PYLORI
HELPY
HELICOBACTER PYLORI
>
HOMO SAPIENS
HUMAN
H. SAPIENS
H.SAPIENS
HUMHBC
HOMO SAPIENS
>

In the first example HELPY is a typical SwissProt species alias and HELICOBACTER PYLORI is typical of what might be found in Genpept. A database such as Owl, which contains entries from several sources, would typically use several aliases.

If a program error directs you to the species.txt file likely problems are:

  1. Line 1 of the species doesn't match the HTML input page.
  2. The HTML input page lists a species which does not have an entry in this file.

Multiple species entries allow you to group species together in a search. A typical example which restricts the search to the HOMO SAPIENS, BOS TAURUS and SUS SCROFA species is:

[Mammals]
HOMO SAPIENS
BOS TAURUS
SUS SCROFA
>

Line 1 contains the identifier for the multiple species entry as it appears in the HTML input files. The identifier is enclosed by the '[' character and the ']' character as in the example. Every multiple species entry in the HTML input pages should have an entry in the species.txt file.

The other lines should contain the names of the species that you which to include in the search. These can either be multiple or single species entries in the species.txt file.

Excluded species entries allow you to exclude species from a search. A typical example which includes all species except HOMO SAPIENS, BOS TAURUS and SUS SCROFA is:

]Model Organisms[
HOMO SAPIENS
BOS TAURUS
SUS SCROFA
>

Line 1 contains the identifier for the excluded species entry as it appears in the HTML input files. The identifier is enclosed by the ']' character and the '[' as in the example. Every excluded species entry in the HTML input pages should have an entry in the species.txt file.

The other lines should contain the names of the species that you wish to exclude. The species that you wish to exclude MUST have single species entries in the species.txt file.


The list of possible constant modifications is now generated automatically from the list of possible variable modifications.


Edit the aa.txt file.

Detailed information on all amino acids used in the programs is located on the server in the file: aa.txt.

You must edit this file to add or change an amino acid or modify amino acid pK values.

Within this file an entry for an amino acid MUST contain 9 lines:
line 1) contains a name for the amino acid, but isn't used anywhere.
line 2) contains a single letter code for the amino acid.
line 3) contains the elemental formula for each residue.
lines 4) and 5) contain elemental formulas for side-chains that are used in calculating d and w ions. If there are no beta substituents, or they are irrelevant, then enter 0 (zero) on these lines.
line 6) contains the pk_C_term for the amino acid.
line 7) contains the pk_N_term the amino acid.
line 8) contains the pk_acidic_sc for the amino acid.
line 9) contains the pk_basic_sc for the amino acid.

Below is an example of the entry for Isoleucine in aa.txt:

Isoleucine
I
C6 H11 N1 O1
C1 H3
C2 H5
3.55
7.5
n/a
n/a

Make sure the elements in your amino acid are present in the file elements.txt. See also, To Add/Change Elements.

If you add a new amino acid, please, send the modified parameter file to for inclusion in subsequent ProteinProspector releases.


Edit the elements.txt file.

Detailed information on all elements used in the programs is located on the server in the elements.txt file. You must edit this file to add or modify an element.

Within the file elements.txt an entry for an element MUST contain 1 line:

The line contains the following information:
a). The symbol for the element.
b). The valency of the element.
c). The number of isotopes listed on the line.
d). A mass/abundance pair for each isotope.

Below is an example of the entry for hydrogen:

H 1 2 1.007825035 .99985 2.014101779 0.00015

If you add a new element, please, send the modified parameter file to for inclusion in subsequent ProteinProspector releases.


Edit the enzyme.txt file.

Detailed information on all enzymatic digests used in the programs is located on the server in the enzyme.txt file. You must edit this file to add or modify the rules for an enzymatic digest.

Within this file an entry for an enzymatic digest MUST contain 4 lines:
line 1) contains a name for the enzymatic digest;
line 2) contains a list of cleavage amino acids;
line 3) contains a list of exception amino acids (a '-' character indicates no exceptions);
line 4) either C for cleavage on the C terminus side of an amino acid or N for cleavage on the N terminus side.

Below is an example of the entry for Trypsin:

Trypsin
KR
P
C

The file enzyme_comb.txt is used to specify enzyme combinations. You can combine the cleavage rules for two or more enzymes by having them on the same line in this file separated by a '/' character. For example to have an option which combines the cleavage rules for CNBr and Trypsin you would need the following line:

Trypsin/CNBr

It is possible to mix enzymes which cleave on the N-terminus side with those that cleave on the C-terminus side.

If you add a new enzymatic digest please send the modified parameter file to for inclusion in subsequent ProteinProspector releases.


Edit the imm.txt file.

The file contains the immonium ion elemental formulae and corresponding compositional information for use by ProteinProspector programs.

The first 2 entries in the file are for the immonium tolerance and the minimum fragment ion mass (both in Da). This is followed by a list of immonium ions.

An entry for an immonium ion contains:

1). The elemental formula using elements defined in elements.txt.

2). The compositional information. List all the amino acids corresponding to the elemental formula.

3). Ions labelled as M are major peaks; these are used to include an amino acid when using immonium ions to extract compositional ions in MS-Tag and MS-Seq. Minor ions are labelled m and are only likely to be present alongside major ions. They are reported in the immonium and related ions section of the MS-Product report.

4). Use I if the ion is an immonium ion or - otherwise.

5). A list of amino acids to exclude if the mass is missing or a dash (-) character if there are no amino acids to exclude. Excluding amino acids on the basis of missing peaks is a feature that can be turned off.

The fields must be separated by the | character.

For example:

C2 H6 N O|S|M|I|-
C4 H8 N|P|M|I|P
C4 H8 N|R|M|-|-
C4 H10 N|V|M|I|-
C3 H8 N O|T|M|I|-
C5 H10 N|KQ|M|-|-
C5 H12 N|IL|M|I|IL
C3 H7 N2 O|N|M|I|-
C4 H11 N2|R|M|-|-
C3 H6 N O2|D|M|I|-
C4 H10 N3|R|m|-|-
C5 H13 N2|K|M|I|-
C4 H9 N2 O|Q|M|I|-
C4 H8 N O2|E|M|I|-
C4 H10 N S|M|M|I|-
C5 H8 N3|H|M|I|H
C5 H10 N3|R|M|-|R
C8 H10 N|F|M|I|-
C6 H8 N O2|P|M|-|-
C6 H13 N2 O|K|m|-|-
C5 H9 N2 O2|Q|m|-|-
C8 H10 N O|Y|M|I|-
C6 H8 N3 O|H|m|-|-
C10 H11 N2|W|M|I|-

Note that the file immonium.txt is still used by MS-Tag Unknome. Instructions for editing it are contained in the file.

Any suggestion for improving this scheme should be sent to for inclusion in subsequent ProteinProspector releases.


Edit the usermod.txt file.

Detailed information on the user defined modifications used in MS-Fit and MS-Digest is located on the server in the usermod.txt file. You must edit this file to add or modify the rules for user defined modifications.

Within this file an entry for a user defined modification MUST contain 4 lines:
line 1) contains a name for the modification;
line 2) contains the code to be used for the modification in MS-Fit and MS-Digest reports;
line 3) contains an elemental formula for the modification (elements can be negative - eg Amidation would be N H O-1);
line 4) contains a list of amino acids to check for the modification.

Below is an example of the entry for Phosphorylation of S, T and Y:

Phosphorylation of S, T and Y
PO4
P O3 H
STY


MS-Fit/MS-Bridge

Edit the fit_graph.par.txt file.

MS-Tag/MS-Product

Edit the pr_graph.par.txt file.

MS-Isotope

Edit the sp_graph.par.txt file.

The graphs in the package are Java applets which use the information in their corresponding parameter file to control their appearance.

The files contains comment lines (starting with a # character) explaining the information fields beneath them. The following information is stored in the file:

  • The graph width in pixels.
  • The graph height in pixels.
  • The width of the graph axes and the lines used to draw the graph in pixels.
  • The graph background color (red green and blue values which must be between 0 and 255).
  • The graph axes color.
  • The default peak color.
  • The number of application colors (should be set to zero for MS-Isotope).
  • The application colors (not relevant for MS-Isotope).
  • The default font - the font for all text except the peak labels.
  • The peak label font.
  • The X-Axis label.

Colors are specified as 3 integers for the red, green and blue intensities respectively. The intensity values must be between 0 and 255.

A font specification is made up of a font family (Dialog, Helvetica, TimesRoman, Courier or Symbol), a font style identifier (PLAIN, BOLD or ITALIC) and a point size.


Edit the colors.txt file.

The following colors may be defined to override the default values.

  • body_background_color
  • text_color
  • link_color
  • vlink_color
  • fit_hit_color
  • tag_hit_color
  • msprod_charge2_color
  • msprod_charge3_color
  • msprod_charge4_color

Colors are defined as name value pairs separated by white space; the body_background_color is defined below as an example. A two digit hexadecimal number is used to define each of the red, green and blue values (eg. 000000 is black, FFFFFF is white, FF0000 is red, 00FF00 is green, 0000FF is blue, AAAAAA is grey, FFFF00 is yellow, etc).

For example:

body_background_color	DDFFDD

To add an Abort Search button to the programs add the following name/value pair:

kill_button 1

If the kill button is a security concern then you can disable it as follows:

kill_button 0


Edit the instrument.txt file.

An entry for an instrument MUST contain least ONE line, instruments are separated by a line with only the ">" symbol. line 1) Must contain the instrument name as it appears in the html input files. Every instrument in the html input pages should have an entry in this file.

This can be followed by optional lines which override the default instrument parameters. The additional lines have the form of name value pairs separated by a space. The possible parameters are listed below:

1). A list of amino acids which lose NH3 in MS/MS fragmentation.

name: nh3_loss
default value: RKNQ

2). A list of amino acids which lose H2O in MS/MS fragmentation.

name: h2o_loss
default value: STED

3). A list of positive charge bearing amino acids.

name: pos_charge
default value: RHK

4). The maximum internal ion mass.

name: max_internal_ion_mass
default value: 700.0

5). The number of decimal places used when printing out parent ions in reports.

name: parent_precision
default value: 4

6). The number of decimal places used when printing out fragment ions in reports.

name: fragment_precision
default value: 2

7). A list of fragment ions types (one per line) which occur in MS/MS fragmentation.

name: it
possible values: a
                 a-H2O
                 a-NH3
                 a-H3PO4
                 b
                 b-H2O
                 b-NH3
                 b+H2O
                 b-H3PO4
                 b-SOCH4
                 c
                 x
                 y
                 y-H2O
                 y-NH3
                 y-H3PO4
                 y-SOCH4
                 Y
                 z
                 I                      Internal ions.
                 C                      C-ladder ions.
                 N                      N-ladder ions.
                 i                      Immonium and low mass ions.
                 m
                 d
                 v
                 w
                 h                      MH-H2O, b-H2O if b, b-H2O if y.
                 n                      a-NH3 if a, b-NH3 if b, y-NH3 if y.
                 B                      b+H2O if b.
                 P                      a-H3PO4 if a, b-H3PO4 if b, y-H3PO4 if y.
                 S                      b-SOCH4 if b, y-SOCH4 if y.
                 MH-H2O

The following ion types are possible in MS-Tag.

a,a-NH3,a-H2O,a-H3PO4,b,b-H2O,b-NH3,b+H2O,b-H3PO4,b-SOCH4,c
y,y-NH3,y-H2O,y-H3PO4,y-SOCH4
I,C,N,h,n,B,P,S

None are defined by default.

Below is an example of the entry for MALDI-TOF:

MALDI-TOF
nh3_loss RKQ
h2o_loss ST
pos_charge RHK
it a-NH3
it a
it b
it b-NH3
it b-H2O
it b+H2O
it y
it y-NH3
it y-H2O
it I
>


Edit the homology.txt file.

An entry for a homology/modified amino acid matrix MUST contain least TWO lines, homology matricies are separated by a line with only the ">" symbol.

line 1) Must contain the matrix name as it appears in the html input files. Every matrix in the html input pages should have an entry in this file.

Subsequent lines (of which there must be at least one) should contain the following information separated by a space:

a). an amino acid;

b). a list of amino acids that the amino acid in a) can mutate or be modified to.

Any of the amino acids in b) may be followed by (N) or (C) to denote that the modification can only take at the N or C terminus of a peptide.

Below are examples of entries for a comprehensive homology option and for an option which allows one unknown amino acid per peptide:

homology
A CDEFGHIKLMNPQRSTVWYmq(N)sty
C ADEFGHIKLMNPQRSTVWYmq(N)sty
D ACEFGHIKLMNPQRSTVWYmq(N)sty
E ACDFGHIKLMNPQRSTVWYmq(N)sty
F ACDEGHIKLMNPQRSTVWYmq(N)sty
G ACDEFHIKLMNPQRSTVWYmq(N)sty
H ACDEFGIKLMNPQRSTVWYmq(N)sty
I ACDEFGHKLMNPQRSTVWYmq(N)sty
K ACDEFGHILMNPQRSTVWYmq(N)sty
L ACDEFGHIKMNPQRSTVWYmq(N)sty
M ACDEFGHIKLNPQRSTVWYmq(N)sty
N ACDEFGHIKLMPQRSTVWYmq(N)sty
P ACDEFGHIKLMNQRSTVWYmq(N)sty
Q ACDEFGHIKLMNPRSTVWYmq(N)sty
R ACDEFGHIKLMNPQSTVWYmq(N)sty
S ACDEFGHIKLMNPQRTVWYmq(N)sty
T ACDEFGHIKLMNPQRSVWYmq(N)sty
V ACDEFGHIKLMNPQRSTWYmq(N)sty
W ACDEFGHIKLMNPQRSTVYmq(N)sty
Y ACDEFGHIKLMNPQRSTVWmq(N)sty
>

Unknown Amino Acid
X ACDEFGHIKLMNPQRSTVWY
>


Computer optimisation options are currently only relevant to the Windows version.

Edit the computer.txt file.

The following parameters are currently available:

1). The default memory block size used in memory mapping.

name: block_size
default value: 65536

This number is applicable for Windows systems and should not be changed.

2). The number of blocks to use as a default memory map size when reading a database.

name: num_blocks
minimum value: 1
default value: 256
maximum value: 16384

The default value assumes that 16 MBytes blocks are mapped in. The maximum value is 1 GByte. You might want to vary this parameter to see if it affects search times. If you have a lot of RAM then a much bigger number would be appropriate.