Description, Instructions, and Tips for

Purpose

This document provides instructions for .

You need not bother reading this document unless you are administering a server running the Protein Prospector programs.


Contents of this document: (all in one file, so it can be printed and read)

Links to topics in the general instructions:

Introduction

FA-Index was developed for the following reasons:

  1. To enable an internal means for the Protein Prospector programs to store an index number when a hit is recorded during a search, then later use that number to retrieve that database entry for output/report generation purposes. This cuts down the memory requirements for program execution.
  2. To provide indices which can be used to accelerate searches that are pre-filtered by intact protein MW, protein pI and/or taxonomy.
  3. To aid the Protein Prospector programs in addressing some of the hindrances inherent in FASTA comment line format heterogeneity.
  4. To allow users to create subset databases based on either a Taxonomy/Protein MW pre-filter or the results of a previous search. Searches performed on these smaller databases are often very much faster than searches performed on complete databases.
  5. To allow users to create databases containing user defined proteins.
  6. To provide a simple means of looking at the content of a database.
  7. To create random, reversed and concatenated databases.
  8. To create a six frame protein translation of a database containing DNA sequences.

The FASTA format for sequence databases was originally developed by Pearson for use with the FASTA program. Today it is probably the most widely used standard format, primarily because its brevity results in the smallest possible file size for sequences.

An example of the format is shown below:

>sp|P28190|AA1R_BOVIN ADENOSINE A1 RECEPTOR.
MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGA
LVIPLAILINIGPRTYFHTCLKVACPVLILTQSSILALLAMAVDRYLRVKIPLRYKTVVT
PRRAVVAITGCWILSFVVGLTPMFGWNNLSAVERDWLANGSVGEPVIECQFEKVISMEYM
VYFNFFVWVLPPLLLMVLIYMEVFYLIRKQLSKKVSASSGDPQKYYGKELKIAKSLALIL
FLFALSWLPLHILNCITLFCPSCHMPRILIYIAIFLSHGNSAMNPIVYAFRIQKFRVTFL
KIWNDHFRCQPAPPIDEDAPAERPDD

As a standard it leaves something to be desired, because the "standard" is that there is a single comment line per entry which must begin with the ">" character and all subsequent lines for an entry contain sequence. However, there are many "standards" as to the arrangement of fields and/or de-limiting of fields in the comment line. Often the comment line is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained.

The FASTA format was chosen for use with Protein Prospector primarily because of it's universality, brevity, and expected ease with which database files could be shared on the same computer with other programs for sequence analysis.

The FA-Index program creates several indices which are much smaller files than the FASTA database file. These indices aid the Protein Prospector programs in addressing some of the hindrances inherent in the FASTA comment line format heterogeneity.


There is no reason that we know of that should prevent use of the FASTA database files by both Protein Prospector programs and other programs which accept FASTA format. Further, we believe it should be possible for the files to be simultaneously read by more than one program at a time. It may be of interest to some users that the SEQUEST program from John Yates' group at the University of Washington also uses FASTA formatted databases.


Often the comment line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. However, this information is NOT consistently organized into fields in the comment line of different FASTA database, though within a specific database it is sometimes consistent.

The way Protein Prospector programs "know" which dialect of FASTA to "speak" with a particular database is via the filename. Acceptable filename prefixes are shown below in bold and the associated comment line format described.

Genpept

Sample entries:

>gi|216790 (D13314) arginine deiminase [Mycoplasma hominis]
>gi|261706|bbs|120303 (S50809) protein LG=immunoglobulin binding protein {immunoglobulin binding domains} [strep
   tococcus, Peptide Recombinant, 455 aa]

Protein Prospector programs designate:

  • accession number: 216790 in the first example, as the number after the first | in the line. This can be delimited by a | or a space.
  • species: Mycoplasma hominis in the first example, as the string between the last set of square brackets in the line.
  • name: arginine deiminase in the first example, as the string between the first space and the last "[" in the line.

Some entries cause problems:

>gi|3928883 unknown

Previously the accession number was taken to be number between the first set of round brackets in the line. However entries like the one above don't have this field. This entry also doesn't contain a species field.

Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/Genpept....unr.

Some other entries are also potentially problematic.

This entry is very long and has been truncated by a > character.

>gi|1387979 (L77099) 44% identity over 302 residues with hypothetical protein from Synechocystis sp, acc
   ession D64006_CD; expression induced by environmental stress; some similarity to gly
   cosyl transferases; two potential membrane-spanning helices [Bacillus subtil>

Neither of the following contain easily extractable species.

>gi|1123088 (U42436) coded for by C. elegans cDNA yk56a1.3; coded for by C. elegans cDNA CEMSG41FB; cod
   ed for by C. elegans cDNA yk81f4.5; coded for by C. elegans cDNA yk56a1.5; coded for by C. ele
   gans cDNA yk81f4.3;  similar to the S5P family of ribosomal proteins
>gi|2330745|gnl|PID|e334350 (Z98598) SPAC1B3.11c, ras-related protein, len:234aa, simi
   lar eg. to RB4B_RAT, P51146, ra

This entry doesn't have anything in the name field but the species is OK.

>gi|1575686 (U70379)  [Synechococcus PCC7942]

The previously used accession number in the round brackets for these two entries are identical.

>gi|3928875 (AF093611) putative chloroplast desaturase [Acetabularia acetabulum]
>gi|3928876 (AF093611) putative chloroplast desaturase [Acetabularia acetabulum]

This entry contain a zone delimited by [ ] characters which is not at the end of the line and doesn't contain a species.

>gi|3881286|gnl|PID|e1350785 (AL021507) [980325 dl] : Prediction spanned chimera, modified bas
   ed on new 3' sequence information (o/l with F14D1); cDNA EST EMBL:D34402 comes from this ge
   ne; cDNA EST EMBL:D37454 comes from this gene; cDNA EST EMBL:D68054 comes from this gene; cDNA E>

Owl

This database can still be downloaded from NCBI but hasn't been updated since 1999. It is thus a redundant, non-redundant database. It is still however searchable by Protein Prospector.

The entries come from four different sources:

>owl|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).

Protein Prospector programs designate:

  • accession number: 100K_RAT as the string between the second | and the first space in the line.
  • species: RAT, as the characters after the underscore in the accession number.
  • name: 100 KD PROTEIN (EC 6.3.2.-). as the string between the first space and the last dash " -" in the line.

>owl|B40638|B40638 isocytochrome c2 - Rhodobacter sphaeroides

Protein Prospector programs designate:

  • accession number: B40638 as the string between the second | and the first space in the line.
  • species: Rhodobacter sphaeroides, as the text string following the last space-dash-space (" - ") in the line.
  • name: isocytochrome c2 as the string between the first space and the last dash " -" in the line.

>owl|Z31371|A7120FTSZ1 A7120FTSZ NID: g1100793 - Anabaena PCC7120.

Protein Prospector programs designate:

  • accession number: A7120FTSZ1 as the string between the second | and the first space in the line.
  • species: Anabaena PCC7120. as the text string following the last space-dash-space (" - ") in the line. Note that there is a full stop after the species which must be deleted.
  • name: A7120FTSZ NID: g1100793 as the string between the first space and the last dash " -" in the line.

>owl||NRL_1A00B hemoglobin beta chain mutant (V1M, W37Y) (deoxy), chain B - human

Protein Prospector programs designate:

  • accession number: NRL_1A00B as the string between the second | and the first space in the line.
  • species: human. as the text string following the last space-dash-space (" - ") in the line.
  • name: hemoglobin beta chain mutant (V1M, W37Y) (deoxy), chain B as the string between the first space and the last dash " -" in the line.

Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/Owl....unr.

A typical comment line causing problems is:

>owl|P15455|12S1_ARATH 12S SEED STORAGE PROTEIN PRECURSOR. - ARABIDOPSIS THALIANA (MOUSE-EAR...

There are three full stops at the end of the line.

Previously the comment lines had the following format:

>10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). - VIGNA UNGUICULATA (COWPEA).
>AEOHFPA AEOHFPA NID: g141875 - A.hydrophila DNA, clone pPH4.
>pir|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).

SwissProt

Sample entry as output by the sp2fasta program:

>sp|P16105|H32_BOVIN HISTONE H3 (H3.2) 

Sample entry if the database is downloaded from NCBI:

>gi|122068|sp|P16105|H32_BOVIN HISTONE H3 (H3.2)

Protein Prospector programs designate:

  • accession number: P16105 the alphanumeric string between the first sp| and the next | in the comment line
  • species: BOVIN, as the string between the underscore and the space in the next field.
  • name: HISTONE H3 (H3.2), as the string following the species

Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line (this usually does not happen for any entries in SwissProt). All of these UNREADABLE lines are then written by FA-Index to the file seqdb/SwissProt....unr.

A few entries are of the following form:

>gi|400027|sp||HYEP_PSESP_2 [Segment 2 of 3] EPOXIDE HYDROLASE (EPOXIDE HYDRATASE)

In these cases the accession number is taken as being the first number ie. 400027 for the example shown.

The species is extracted from the code between the last vertical bar and the first space. It appears after the first underscore and before the second underscore (if present). The name is the rest of the line.

Another format used is:

>P15711|104K_THEPA 104 kDa microneme-rhoptry antigen precursor (p104) - Theileria parva

Protein Prospector programs designate:

  • accession number: P15711 as the alphanumeric string before the first |
  • species: THEPA, as the string between the underscore and the space in the second field.
  • name: 104 kDa microneme-rhoptry antigen precursor (p104), as the string between the first space and the last dash "-" in the line.

UniProt

Sample entries:

>104K_THEPA (P15711) 104 kDa microneme-rhoptry antigen
>O05152_SULAC (O05152) Glycogen debranching enzyme

Protein Prospector programs designate:

  • accession number: P15711 as the alphanumeric string in the brackets following the first field
  • species: THEPA, as the string between the underscore and the space in the first field.
  • name: 104 kDa microneme-rhoptry antigen, as the string following the accession number field

>P15711|104K_THEPA 104 kDa microneme-rhoptry antigen precursor (p104) - Theileria parva

Protein Prospector programs designate:

  • accession number: P15711 as the alphanumeric string before the first |
  • species: THEPA, as the string between the underscore and the space in the second field.
  • name: 104 kDa microneme-rhoptry antigen precursor (p104), as the string between the first space and the last dash "-" in the line.

The latest UniProt format looks like this:

>sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1
>tr|A0AQI4|A0AQI4_9ARCH Putative ammonia monooxygenase (Fragment) OS=uncultured archaeon GN=amoA PE=4 SV=1

The format of these lines is as follows:

>db|UniqueIdentifier|EntryName ProteinName OS=OrganismName[ GN=GeneName]PE=ProteinExistence SV=SequenceVersion

Field Description
db 'sp' for UniProtKB/Swiss-Prot and 'tr' for UniProtKB/TrEMBL.
UniqueIdentifier The primary accession number of the UniProtKB entry.
EntryName The entry name of the UniProtKB entry.
ProteinName The recommended name of the UniProtKB entry as annotated in the RecName field from release 14.0 on. For UniProtKB/TrEMBL entries without a RecName field, the SubName field is used. The 'precursor' attribute is excluded, 'Fragment' is included with the name if applicable.
OS OrganismName is the scientific name of the organism of the UniProtKB entry.
GN GeneName is the first gene name of the UniProtKB entry. If there is no gene name, OrderedLocusName or ORFname, the GN field is not listed.
PE ProteinExistence is the numerical value describing the evidence for the existence of the protein.
   1. Evidence at protein level.
   2. Evidence at transcript level.
   3. Inferred from homology.
   4. Predicted.
   5. Uncertain.
SV SequenceVersion is the version number of the sequence.

Protein Prospector programs designate:

  • accession number: as the UniqueIdentifier field.
  • species: as the string between the underscore and the space in the EntryName field.
  • name: as the ProteinName field.
  • UniProt ID: as the EntryName field.
  • organism: as the OrganismName field.
  • gene name: as the GeneName field.
  • existence: as the ProteinExistence field.
  • version: as the SequenceVersion field.

IPI

Sample entries:

>IPI:IPI00177321.1|REFSEQ_XP:XP_168060|ENSEMBL:ENSP00000343431 Tax_Id=9606 similar to NOD3 protein
>IPI:IPI00015171.1|UniProt/Swiss-Prot:O43931 Tax_Id=9606 AFG3-like protein 1

Protein Prospector programs designate:

  • accession number: IPI00177321.1 as the alphanumeric string between the first colon and the first vertical bar
  • species: as UNREADABLE.
  • name: similar to NOD3 protein, as the string following the second space in the line

NCBInr

The comment lines from this database are tricky to handle because it is a non-redundant database which collects entries from several databases, thus there are several formats present in the final database.

Further information is available from the NCBI site.

1. Genpept Entries

Note that this format has now been discontinued and all Genpept entries are now in the format described in section 2. Support for this format will be continued for a while for people who have an old copy of the database. The corresponding comment lines in the new format are given in section 2.

>gi|304881 (L07596) alaS [Escherichia coli]

Protein Prospector programs designate:

  • accession number: 304881, as all consecutive digits following the first "|"
  • species: Escherichia coli, as the text string inside the last set of square brackets.
  • name: (L07596) alaS , as the text string between the first space the last set of square brackets.

Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/NCBInr....unr.

Lines which are too long are terminated by three full stops:

>gi|2429520 (AF025469) Similar to acetyl-CoA carboxylase; coded for by C. elegans cDNA yk16c3.3; cod
   ed for by C. elegans cDNA yk36b11.3; coded for by C. elegans cDNA yk43h8.3; coded for by C. eleg
   ans cDNA yk24d2.3; coded for by C. elegans cDNA yk24d2.5;...

This line has a space at the end of the species:

>gi|520517 (U10338) RNA polymerase II, largest subunit [Ilyanassa obsoleta ]

The following entry has no species:

>gi|3928883 unknown

Here are some more examples. Note that what is in the species field isn't always a species.

>gi|149575 (M76708) L(+)-lactate dehydrogenase [Lactobacillus casei]
>gi|45803 (X04609) gamma subunit (3'terminus); pid:g45803 [thermophilic bacterium PS3]
>gi|289135 (L10036) unknown [Anabaena PCC7120]
>gi|402254 (U01238) beta subunit of the molybdenum-iron nitrogenase [Frankia sp.]
>gi|414523 (U02284) beta-lactamase [Cloning vector pSP65]
>gi|439619 (L25848) [Salmonella typhimurium IS200 insertion sequence from SARA17, partial.], ge
   ne product [Salmonella typhimurium]
>gi|431128 (L15633) start [Transposon Tn916]
>gi|466378 (U07618) SSB [Unknown]
>gi|403947 (U01693) (M90060);  Homology to GenBank Accession numORF-X from STRATPASEA [Mycopl
   asma genitalium]
>gi|405516 (L22217) This ORF is homologous to nitroreductase from Enterobacter cloacae, Acc
   ession Number A38686, and Salmonella, Accession Number P15888. [Mycoplasma-like organism]
>gi|457139 (L29100) transposase [Insertion sequence IS150 homolog]
>gi|468279 (L31491) nreA [pTOM9]
>gi|413733 (L25424) orf 1 [Plasmid pCB2.4]
>gi|144453 (M94320) very similar to DNA polymerase of Bacillus subtilis bacteriophage SPO2; potent
   ial DNA polymerase; putative [Citrus greening disease-associated bacterium-like organism]
>gi|971400 (X88862) immunogenic polyprotein with 2A protease [Foot-and-mouth disease virus]
>gi|1008449 (L19624) envelope glycoprotein [Human immunodeficiency virus type 1]
>gi|1718307 (U75698) ORF 54; dUTPase homolog; EBV BLLF3 homolog [Kaposi's sarcoma-associated herpesvirus]
>gi|2271117 (AF008696) hemagglutinin [influenza A virus (A/South_Australia/68/92(H3N2))]
>gi|2444119 (U88974) ORF40 [Streptococcus thermophilus temperate bacteriophage O1205]
>gi|2662546 (AF036688) No definition line found [Caenorhabditis elegans]
>gi|4206510 (AF066801) ribulose 1,5-bisphosphate carboxylase [Dictamnus sp. M.W.Chase-1820K]

2. GenBank Entries

>gi|1680564|gb||S58174_1 (S58174) putative RNA polymerase [Pelargonium leaf curl virus]

gb|accession|locus

Protein Prospector programs designate:

  • accession number: 1680564, as all consecutive digits following the first "|"
  • species: Pelargonium leaf curl virus, as the text string inside the last set of square brackets.
  • name: (S58174) putative RNA polymerase, as the text string between the first space the last set of square brackets.

Here are some more examples.

>gi|1683178|gb||S69825_2 (S69825) coat/capsid protein [Sweet potato feathery mottle virus (strain CH)]
>gi|1683615|gb||S81342_1 (S81342) unnamed protein product [Mus sp.]

Here are some example entries which have changed from format 1 to this format

>gi|304881|gb|AAA71918.1| (L07596) alaS [Escherichia coli]
>gi|520517|gb|AAA50229.1| (U10338) RNA polymerase II, largest subunit [Ilyanassa obsoleta]
>gi|3928883|gb|AAC79708.1| unknown
>gi|289135|gb|AAD04186.1| (L10036) unknown [Anabaena PCC7120]
>gi|402254|gb|AAA03325.1| (U01238) beta subunit of the molybdenum-iron nitrogenase [Frankia sp.]
>gi|414523|gb|AAB60535.1| (U02284) beta-lactamase [Cloning vector pSP65].gi|644827|gb|AAA64566.1| (U19867) be
   ta-lactamase [Cloning vector pSPL3]
>gi|431128|gb|AAC36978.1| (L15633) start [Transposon Tn916]
>gi|466378|gb|AAA17041.1| (U07618) SSB [Plasmid R751]
>gi|403947|gb|AAB01006.1| (U01693) (M90060);  Homology to GenBank Accession numORF-X from STRATPASEA [Myco
   plasma genitalium]
>gi|405516|gb|AAA18506.1| (L22217) This ORF is homologous to nitroreductase from Enterobacter cloacae, Acc
   ession Number A38686, and Salmonella, Accession Number P15888. [Phytoplasma sp.]
>gi|457139|gb|AAA98137.1| (L29100) transposase [Bacillus thuringiensis]
>gi|468279|gb|AAA72440.1| (L31491) nreA [Plasmid pTOM9]
>gi|413733|gb|AAA97418.1| (L25424) orf 1 [Plasmid pCB2.4]
>gi|144453|gb|AAA23103.1| (M94320) very similar to DNA polymerase of Bacillus subtilis bacteriophage SPO2; pot
   ential DNA polymerase; putative [Citrus greening disease-associated bacterium]
>gi|1008449|gb|AAA78793.1| (L19624) envelope glycoprotein [Human immunodeficiency virus type 1]
>gi|1718307|gb|AAC57136.1| (U75698) ORF 54; dUTPase homolog; EBV BLLF3 homolog [Kaposi's sarcoma-associated her
   pesvirus].gi|2246506|gb|AAB62631.1| (U93872) ORF 54, dUTPase homolog [Kaposi's sarcoma-associated herpesvirus]
>gi|2271117|gb|AAB66763.1| (AF008696) hemagglutinin [influenza A virus (A/South_Australia/68/92(H3N2))]
>gi|4206510|gb|AAD11686.1| (AF066801) ribulose 1,5-bisphosphate carboxylase [Dictamnus sp. M.W.Chase-1820K]

3. SWISS-PROT Entries

>gi|132349|sp|P15394|REPA_AGRTU REPLICATING PROTEIN

sp|accession|entry name

Protein Prospector programs designate:

  • accession number: 132349, as all consecutive digits following the first "|"
  • species: AGRTU, as the text string between the underscore and the next space when preceded by "sp|...|"
  • name: REPLICATING PROTEIN, as the text string following the species.

Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/NCBInr....unr.

Lines which are too long are terminated by three full stops:

>gi|123494|sp|P22291|SULD_STRPN BIFUNCTIONAL FOLATE SYNTHESIS PROTEIN (DIHYDRONEOPTERIN ALDOL
   ASE (DHNA) / 2-AMINO-4-HYDROXY-6-HYDROXYMETHYLDIHYDROPTERIDINE PYROPHOSPHO
   KINASE (7,8-DIHYDRO-6-HYDROXYMETHYLPTERIN PYROPHOSPHOKINASE) (HPPK) (6-HYDROXYMETHYL-7...

A few entries are of the following form:

>gi|4033439|sp||LEC_VICVI_1 [Segment 1 of 4] LECTIN B4 (VVLB4)

In these cases the species field is terminated in an underscore (VICVI for the example shown).

4. GNL Entries

>gi|216351|gnl|PID|d1003451 (D13793) ORF [Bacillus subtilis]

Here gnl stands for general and the next field, PIR in the above case, identifies the database.

gnl|database|identifier

Protein Prospector programs designate:

  • accession number: 216351, as all consecutive digits following the first "|"
  • species: Bacillus subtilis, as the text string inside the last set of square brackets.
  • name: (D13793) ORF, as the text string between the first space the last set of square brackets.

5. NBRF PIR Entries

>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans

pir||entry

Protein Prospector programs designate:

  • accession number: 282349, as all consecutive digits following the first "|"
  • species: Bacillus circulans, as the text string following the last space-dash-space (" - ") in the line.
  • name: chitinase (EC 3.2.1.14) D, as the text string between the first space and the last dash " -" in the line.

Here are some more examples:

>gi|80297|pir||JN0146 hypothetical protein (div+ 3' region) - Bacillus subtilis (fragment)
>gi|77616|pir||A36125 branched-chain amino acid transport protein braC - Pseudomonas aerug
   inosa (strain PAO)
>gi|538696|pir||A40613 avirulence protein avrRpt2 - Pseudomonas syringae (strain DC3000, pv. tomato)
>gi|98505|pir||S21241 oligo-1,6-glucosidase (EC 3.2.1.10) - Bacillus "thermoamyloliquefaciens" (st
   rain KP1071) (fragment)
>gi|320384|pir||A37388 probable DNA-binding protein 1A - Thermus aquaticus (strain HB8) inser
   tion sequence IS1000
>gi|477498|pir||A49131 releasechannel homolog - fruit fly (Drosophila melanogaster) (fragment)

6. GenInfo Backbone Id Entries

>gi|3712669|bbs|85194 (S85224) vascular endothelial growth factor; VEGF 206 [Homo sapiens]

bbs|number

Protein Prospector programs designate:

  • accession number: 3712669, as all consecutive digits following the first "|"
  • species: Homo sapiens, as the text string inside the last set of square brackets.
  • name: (S85224) vascular endothelial growth factor; VEGF 206, as the text string between the first space the last set of square brackets.

Sometimes the species field isn't present:

>gi|386067|bbs|133197 cytochrome c3

Sometimes it contains extra text apart from the species name:

>gi|386065|bbs|133195 cytochrome c3 {N-terminal} [Desulfovibrio vulgaris, NCIMB 8303, Peptide Par
   tial, 22 aa]

Sometimes it appears twice:

>gi|236142|bbs|57690 (S57688) EF-G=elongation factor G [Thermotoga maritima, Peptide, 682 aa] [Ther
   motoga maritima]

Sometimes the species is recorded as unidentified:

>gi|913316|bbs|163145 (S76565) T-cell receptor beta chain VJ region {clone N4} [not specified, ves
   icular stomatitis virus-specific CTL, Peptide Partial, 15 aa] [unidentified]

Here are a couple of examples where the comment line has been truncated. In such cases it is terminated by three full stops:

>gi|435743|bbs|139151 (S66567) alpha-atrial natriuretic factor/coat protein, alpha-ANF/coat prot
   ein=fusion polypeptide(coat protein, alpha-atrial natriuretic factor, alpha-ANF) [human, bact
   eriophage fr, expression vector pFAN15, Peptide PlasmidSynthetic...
>gi|833965|bbs|160632 (S75335) polyprotein(structural protein C, structural protein E, structur
   al protein M, structural protein PreM, nonstructural protein NS1) [dengue type 1 D1 virus, Mochi
   zuki, Peptide Partial, 50 aa, segment 2 of 2] [Dengue virus ty...

7. Brookhaven Protein Data Bank Entries

>gi|230242|pdb|1PFK|A Escherichia coli
>gi|4139942|pdb|1BC5|T Chain T, Chemotaxis Receptor Recognition By Protein Methyltransferase Cher
>gi|231004|pdb|4ER4|I synthetic construct
>gi|494001|pdb|1EGF|  Epidermal Growth Factor (Egf) (Nmr, 16 Structures)
>gi|493782|pdb|146L|  Lysozyme (E.C.3.2.1.17) Mutant With Cys 54 Replaced By Thr, Cys 97 Repl
   aced By Ala, Leu 121 Replaced By Met, Ala 129 Replaced By Leu, Leu 133 Replaced By Met, Val 149 Repl
   aced By Ile, Phe 153 Replaced By Trp (C54t,C97a,L121m,A129l,...
>gi|230275|pdb|1R1A|1 Human rhinovirus 1A

pdb|entry|chain

Protein Prospector programs designate:

  • accession number: 230275 in the first example, as all consecutive digits following the first "|"
  • species: as UNREADABLE because it isn't reliably positioned within the comment line.
  • name: as the entire comment line.

All the comment lines of this format are written by FA-Index to the file seqdb/NCBInr....unr.

8. Protein Research Foundation Entries

>gi|742246|prf||2009326A beta glucosidase [Cellvibrio gilvus]

prf||name

Protein Prospector programs designate:

  • accession number: 742246, as all consecutive digits following the first "|"
  • species: Cellvibrio gilvus, as the text string inside the last set of square brackets.
  • name: beta glucosidase, as the text string between the first space the last set of square brackets.

Here is another example.

>gi|225172|prf||1210227A amylase subtilisin inhibitor alpha [Hordeum vulgare var. distichum]

9. DNA Database of Japan (DDBJ) Entries

>gi|2440229|dbj||AB006689_5 (AB006689) ORF13 [Agrobacterium rhizogenes]

Protein Prospector programs designate:

  • accession number: 2440229, as all consecutive digits following the first "|"
  • species: Agrobacterium rhizogenes, as the text string inside the last set of square brackets.
  • name: (AB006689) ORF13, as the text string between the first space the last set of square brackets.

dbj|accession|locus

Here is another example.

>gi|1805521|dbj||D90852_18 (D90852) ORF_ID:o250#11; similar to [SwissProt Accession Number P19779]; sta
   rt codon is not identified yet [Escherichia coli]

10. EMBL Data Library Entries

>gi|6|emb|CAA42669.1| (X60065) beta-2-glycoprotein  I [Bos taurus]

emb|accession|locus

Protein Prospector programs designate:

  • accession number: 6, as all consecutive digits following the first "|"
  • species: Bos taurus, as the text string inside the last set of square brackets.
  • name: (X60065) beta-2-glycoprotein I, as the text string between the first space the last set of square brackets.

Sometimes the species field isn't present:

>gi|6065756|emb|CAB58425.1| (AJ238324) Clostridium difficile binary toxin A

Here is an example where the comment line has been truncated. In such cases it is terminated by a > character:

>gi|6018922|emb|CAB58111.1| (AL121806) /prediction=(method:""genefinder"", version:""084"", sc
   ore:""32.36"")~/prediction=(method:""genscan"", version:""1.0"")~/match=(desc:""EUKARYOTIC TRAN
   SLATION INITIATION FACTOR 4E (EIF-4E) (EIF4E) (MRNA CAP-BINDING PROTEIN) (EIF-4F 25 KD SUBU>

11. NCBI Reference Sequences Entries

>gi|5713315|ref|NP_002060.1| guanine nucleotide binding protein (G protein), alpha inhibit
   ing activity polypeptide 1

ref|accession|locus|:q

Protein Prospector programs designate:

  • accession number: 5713315 as all consecutive digits following the first "|"
  • species: as UNREADABLE because it isn't generally present in the comment line.
  • name: as the entire comment line.

All the comment lines of this format are written by FA-Index to the file seqdb/NCBInr....unr.

dbEST

This database wins the booby prize as the one with the least consistent comment lines.

Sample entry:

>gi|1705383|gb|N20717|N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5'

Protein Prospector programs designate:

  • accession number: 1705383, as all consecutive digits following "gi|"
  • species: Schistosoma mansoni; since this database is so haphazard in its placement of the species, FA-Index does a string search in the line after first consulting the file dbEST.spl.txt for valid species names. The string search method is possible with this particular database because there is a more limited range of species represented. However, this means that a server administrator needs to keep the dbEST.spl.txt file up to date to ensure continuous high quality species searching of dbEST with Protein Prospector programs. This task, though annoying, is made somewhat easier by consulting the seqdb/dbEST.unr file. Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/dbEST.unr.
  • name: N20717 SMNHADA002044SK SmAW Schistosoma mansoni cDNA 5', as the string following the first space.

Sometimes the comment lines are very long and appear to consist of two comment lines appended together. The two comment lines are separated by a non-printable binary character (ASCII code Control A) shown here as a full stop. In such cases Protein Prospector only considers the first part of the comment line.

>gi|3771232|gb|AI209290|AI209290 SWOvAFCAP09G09SK Onchocerca volvulus adult female cDNA (SAW98MLW-OvAF) Onch
   ocerca volvulus cDNA clone SWOvAFCAP09G09 5', mRNA sequence [Onch
   ocerca volvulus].gi|3789602|gb|AI216948|AI216948 SWOvAFCAP10G11SK Onch
   ocerca volvulus adult female cDNA (SAW98MLW-OvAF) Onchocerca volvulus cDNA cl
   one SWOvAFCAP10G115', mRNA sequence [Onchocerca volvulus]

Ludwignr

This is another non-redundant database. The entries are of following format:

db|accno|ID|CRC Description[species]

db - database
CRC - 64bit cyclic redundancy check

Here are some example - one from each of the components of the database:

>gp|M84711|182775|000037AE195F7A9D v-fos transformation effector protein [Homo sapiens]
>gp|AL391014|9716128|0006579AD1B1EEE8 putative DNA-binding protein [Streptomyces coelicolor A3(2)]
>pir|A91719|GGIC1A|0027F62F6F36BA36 globin CTT-IA - midge (Chironomus thummi thummi)[Chironomus thu
   mmi thummi]
>pir|JX0361|JX0361|00013E4475F84453 subtilisin-trypsin inhibitor, SIL10 - Streptomyces sp.[Streptom
   yces sp.]
>pir|JC7193|PC7055|0154D83E82AA822B cell division protein FtsQ - Streptomyces collinus (fragment)[Strept
   omyces collinus]
>pir|A29526|A29526|02AC2025766BCBC7 ubiquitin B processed pseudogene - human[Homo sapiens]
>sp|P55820|SN25_RABIT|00014F740FEB29C5 (SNAP..)SYNAPTOSOMAL-ASSOCIATED PROTEIN 25 (SNAP-25) (SUPER PRO
   TEIN) (SUP) (FRAGMENTS).[Oryctolagus cuniculus]
>sp_vs|P16157-01|P16157|004EDB42F81EBDE8 ISOFORM 2.2 OF P16157[Homo sapiens]
>tr|AF247519|AAF71733|0001F06BB33BD2E8 Gag protein (Fragment).[Human immunodeficiency virus type 1]
>tr|U83613|O09751|0000148C132C06BD (POL)REVERSE TRANSCRIPTASE (FRAGMENT).[Human immuno
   deficiency virus type 1]
>tr_vs|P70390-01|P70390|0172F8C6825A0023 ISOFORM OG12B/PRX3B OF P70390[Mus musculus]
>wp|CE24847|C44C3.3|0205CAE438EE8B14 (ST.LOUIS) TR:P91157 protein_id:AAB37360.1[C. elegans]
>yp|ORFP:YDR094W|0642CC1F954A58E2 YDR094W, Chr IV from 635833-636168[S. cerevisiae]

Protein Prospector programs designate (example from first line above):

  • accession number: M84711 as the text string between the first vertical bar and the next vertical bar.
  • species: Homo sapiens as the text string inside the last set of square brackets.
  • name: v-fos transformation effector protein, as the text string between the first space and the last set of square brackets.

Often the comment line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. With well curated databases, this information is consistently organized into fields in the comment line of a FASTA formatted database.

For Protein Prospector programs the sequence field is only subject to 2 constraints. 1) it must be in CAPITAL lettters, and 2) it must be in single letter code (some people express amino acids in 3-letter code).

The way Protein Prospector programs "know" which dialect of FASTA to "speak" with a particular database's comment line is via the filename. Generic filename prefixes are shown below in bold and the associated comment line format described. These formats are handled in a relatively robust manner, to allow for the absence of fields or the presence of additional fields. The formats basically consist of "|" delimited fields of accession number, name, and species in that order.

DN and PN

The D forms designate that the sequence is DNA and will be translated into protein sequence by Protein Prospector programs. The P forms indicate protein sequence.

> 417909| Better than sliced bread growth factor beta|Mouse|pancreas|

Protein Prospector programs designate:

  • accession number: 417909, as the integer before the first "|"
  • species: Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")
  • name: Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")

Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/DN.unr, or seqdb/PN.unr.

DA and PA

The D forms designate that the sequence is DNA and will be translated into protein sequence by Protein Prospector programs. The P forms indicate protein sequence.

Note that the DA and PA differ from the DN and PN set only in that the accession number can be alphanumeric rather than numeric. This second set is thus more robust. However, for large, frequently updated databases FA-Index can take an hour to run rather than several minutes simply because creation of the dbfilename.acc file involves the much slower process of sorting strings rather than integers.

> SlowSort909| Better than sliced bread growth factor beta|Mouse|pancreas|

Protein Prospector programs designate:

  • accession number: SlowSort909, as the alphanumeric string before the first "|"
  • species: Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")
  • name: Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")

Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by FA-Index to the file seqdb/DA.unr or seqdb/PA.unr.

Any number of proprietary databases may be created with DA, DN, PA or PN prefixes. You must also create accession number links for any databases which you create.

Ddefault and Pdefault

If these prefixes are used then all attempts at trying to extract information from the comment line are abandoned.

Protein Prospector programs designate:

  • accession number: as the entry number (1 for the first entry, 2 for the second entry, etc).
  • species: as UNREADABLE.
  • name: as the entire comment line.

Ddefault is used for a database containing DNA sequences and Pdefault for one containing protein sequences.


Suffix
(databasefilename.xxx)
Description
.idc Contains a list of byte offsets for the start of the comment line for each entry in the database.
.idp Contains a list of byte offsets for the start of the sequences for each entry in the database.
.idi Contains the number of entries in the database, the length of the longest comment line in the database and the length of the longest sequence in the database.
.mw Index containing the calculated protein Molecular Weight (MW) of each sequence in the database. For DNA sequences this MW is calculated by translating in frame 1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .mw file is used to accelerate searches that are constrained by intact MW.
.pi Index containing the calculated protein pI of each sequence in the database. For DNA sequences this pI is calculated by translating in frame 1 and ignoring stop codons. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .pi file is used to accelerate searches that are constrained by intact pI.
.tax Index containing the taxonomy code for each sequence in the database. A line containing the taxonomy node and the number of times it occurs in the database is followed by lines containing the index numbers of the corresponding entries. The taxonomy nodes are in numerical order. Used to accelerate searches that are constrained by taxonomy.
.tl Contains a list in numerical order of the taxonomy nodes that are present in the database. A line in the file contains the taxonomy node followed by the number of times it occurs. This file is provided for diagnostic purposes and is not used by the Protein Prospector search programs.
.unr File created to list the comment lines of each entry for which FA-Index cannot read the species used for taxonomy searches. This file is never used by the Protein Prospector programs; it is created only for use by server administrators in troubleshooting species problems.
.acc Index of alphanumeric accession numbers, created only for database filename prefixes: Genpept, gen, SwissProt, swp, Owl, owl, DA, PA.
.acn Index of integer accession numbers, created only for database filename prefixes: NCBInr, nr, dbEST, dbest, DN, PN.

Suffix
(databasefilename.xxx)
Bypassable How to by-pass if possible
.idc no Necessary for any Protein Prospector program that searches/consults a database file.
.idp no Necessary for any Protein Prospector program that searches/consults a database file.
.idi no Necessary for any Protein Prospector program that searches/consults a database file.
.mw yes Select All in the MW search parameters.
.pi yes Select All in the pI search parameters.
.tax yes Select All in the Taxonomy search parameters.
.tl yes This file is never used by the Protein Prospector programs; it is used to report the contents of the species fields in the database file.
.unr yes This file is never used by the Protein Prospector programs; it is created only for use by server administrators in troubleshooting taxonomy problems.
.acc
.acn
yes Don't choose retrieve by Accession number in MS-Digest, or set the search mode to Accession number in MS-Pattern.


Once you've downloaded a new database into the seqdb directory you need to create the index files described above before you can start to use it. To do this:

1). Type the name of the database into the Newly Downloaded Database field.

2). Press the Create Indicies For New Database button.

This feature is only available to Protein Prospector licensees.


A DNA database can be converted to a 6 frame translation protein database which contains the protein sequences for each of the 6 possible frame translations (3 forward and 3 backwards) of the DNA sequence. This will make the searching somewhat faster as otherwise Protein Prospector has to do the translation during the search. The database will be about twice the size as the original DNA one.

1). Type the name of the DNA database into the Newly Downloaded Database field.

2). Tick the DNA To Protein checkbox.

3). If you want to delete the original DNA database tick the Delete DNA Database checkbox.

4). Press the Create Indicies For New Database button.

The newly created database will have the same name as the original DNA database except it will have a p character at the start (eg dbEST would become pdbEST).

This feature is only available to Protein Prospector licensees.


A random database is a database where each protein sequence is shuffled in a random way. Thus many things about the database are conserved such as the amino acid frequencies and the distribution of protein molecular weights. If the Use Seed checkbox is checked then the randomization is seeded by seed entered. Thus the random database created should be the same if you run FA-Index multiple times on the same database. If the database is updated or FA-Index is first run on a LINUX node and then on a Windows node then the random databases created will not be the same. If the Use Seed checkbox is not checked then the randomization will be seeded by the computer clock and different randomizations will be generated if you run FA-Index multiple times on the same database.

1). Type the name of the database into the Newly Downloaded Database field.

2). Check the Random Database checkbox.

3). Press the Create Indicies For New Database button.

The newly created database will have the same name as the original database except it will have have a .random suffix (eg SwissProt would become SwissProt.random).

You can only create random databases from protein databases.

This feature is only available to Protein Prospector licensees.


A reverse database is a database where each protein sequence is reversed.

1). Type the name of the database into the Newly Downloaded Database field.

2). Check the Reverse Database checkbox.

3). Press the Create Indicies For New Database button.

The newly created database will have the same name as the original database except it will have have a .reverse suffix (eg SwissProt would become SwissProt.reverse).

You can only create reverse databases from protein databases.

This feature is only available to Protein Prospector licensees.


A concatenated database is formed by concatenating a random or reversed database on to the end of a normal database. Note that after version 5.10.0 concatenated database are no longer necessary. If you create either a random or reverse database and the normal database is also present a concatenated entry will automatically be generated on the database menu. This obviously saves disk space.

1). Type the name of the database into the Newly Downloaded Database field.

2). Check the Random Database or b>Reverse Database checkbox.

3). Check the Concat Database checkbox.

4). Press the Create Indicies For New Database button.

The newly created database will have the same name as the original database except it will have have a .random.concat or .reverse.concat suffix (eg SwissProt would become SwissProt.random.concat or SwissProt.reverse.concat).

You can only create concatenated databases from protein databases.

This feature is only available to Protein Prospector licensees.


Protein Prospector licensees can create their own subset databases which have been pre-filtered for taxonomy, taxonomy names, molecular weight, pI and accession number. For example to create a subset database of human proteins between 1000-100000 Da from the SwissProt database:

1). Choose a suitable suffix for the database such as human.

2). Select SwissProt.xx as the existing database.

3). Select HOMO SAPIENS as the taxonomy.

4). Enter 1000 to 100000 as the MW of the Protein and deselect All.

5). Press the Create Subset Database button.

Using subset databases is likely to dramatically decrease search times.

This feature is only available to Protein Prospector licensees.


The Hits (index numbers for matching database entries) from Protein Prospector search programs can be saved to a user-specified file. This file can then be used create a subset database containing only the Hit proteins from the search.

1). Choose a suitable suffix for the database. The suffix must be unique; if you use the same suffix twice then the previously created subset database will be overwritten.

2). Identify the database that was used in the original search.

3). Identify the file containing the saved hits by entering the Program and File Name.

4). Press the Create Subset Database with Indices from Saved Hits button.

You can't do this if you have updated the database since doing the original search as the index numbers could have changed.

This feature is only available to Protein Prospector licensees.


It is possible to create your own fasta format database which can be searched by the Protein Prospector search programs. An entry for a single protein or DNA sequence is made up of a comment line containing accession number, species and name fields followed by one or more lines containing the sequence.

1). Enter the database name. There are several dialects of fasta with the essential difference between them being the format of the comment line. You are strongly advised to use a proprietary format but it is also possible to use a public format. If you choose a database name that already exists on the disk then subsequent proteins will be appended to the end of the file, otherwise a new database file will be created. It is possible to append entries to the end of the publicly available databases but this is not advisable; firstly because the index files are remade after each entry, secondly because newer versions of the database won't contain your entries and thirdly because any errors in the information you supply when adding the entry could potentially damage the whole database. If you want to use a public database format you should use a database name such as NCBInr.user.

2). Enter a name for the entry. Whether you are using a proprietary format or a public format make sure you don't use characters in the name which might give the Protein Prospector programs problems in sorting out the fields in the comment line.

3). Enter a species for the entry. You should use the scientific name.

4). Enter an accession number for the entry. The accession number must be unique; the program will alert you if it isn't. If your database uses numeric accession numbers then the accession number must be numeric.

5). Enter the protein or DNA sequence using only the upper case symbols for the 20 naturally occurring amino acids or the four base pairs as appropriate. X may also be used to if the sequence is unknown at a particular point. In a protein database U may be used for Selenocysteine.

6). Press the Create or Append to User Database button.

Note that the default database of a particular type (SwissProt, UniProt, etc) chosen by Protein Prospector is the one with the largest .idc file that is not a random, reverse or concatenated database. Thus if you append entries to one of the standard databases it may get chosen as the default database.


The database summary report option is used to list the accession numbers, species and name fields for a selected index number range of a selected database. Deselect the Hide Protein Sequence checkbox if you also want to see the protein sequences. You can also select the DNA Reading Frame if you are looking at a DNA database.


FA-Index can also be run from the command line. You might want to do this if you want to set up a batch job to automatically update the databases or if running it from the web page interface causes a time out.

On all operating systems the FA-Index program is expected to reside in the same directory as all other Protein Prospector programs (i.e. cgi-bin). The command line version of FA-Index can accept the same parameters as the web page version. The parameters are specified as name value pairs separated by spaces. A list of all the parameters is given in the Protein Prospector Automation Manual.

All the Protein Prospector programs can be similarly run from a command line interface. See the Protein Prospector Automation Manual for details.

On UNIX systems issue a command from the cgi-bin directory of the form:

./faindex.cgi - create_database_indicies=1 database=SwissProt.2012.12.04

On Windows systems use an MS-DOS command prompt to issue a command of the form:

C:\Program Files\UCSF\Prospector\web\cgi-bin> faindex.cgi - create_database_indicies=1 database=SwissProt.2007.12.04


There is a Perl script called autofaindex.pl in the cgi-bin directory which can be used to automatically download and index databases. The properties of the database downloads are contained in the dbhosts.txt file. This file also contains the instructions whether or not to create random and/or reverse databases.

Multiple databases can be downloaded and indexed in a single run of the program.

On UNIX systems issue a command from the cgi-bin directory of the form:

./autofaindex.pl SwissProt

To process multiple databases simply add them to the command line:

./autofaindex.pl SwissProt NCBInr

On Windows systems use an MS-DOS command prompt to issue a command of the form:

C:\Program Files\UCSF\Prospector\web\cgi-bin> autofaindex.pl SwissProt