University of California, San Francisco About UCSF Search UCSF UCSF  medical center
Description, Instructions, and Tips for MS-Homology

Purpose

This document provides instructions for MS-Homology.


Contents of this document:

Links to topics in the general instructions:

Introduction

MS-Homology was developed for database searching with peptide de novo sequencing information. MS fragmentation data can first be searched using MS-Tag. If no matches are made, it may be due to insufficient data or because the protein is not in the database. Before using MS-Homology, de novo sequencing is performed on each of the peptides by hand [K.F. Medzihradszky and A.L. Burlingame, Methods: A Companion to Methods in Enzymology 6: 284-303 (1994)] or by a de-novo sequencing program. The resulting sequence possibilities are input into MS-Homology in the sequence list. Different peptides from the same unknown protein can be entered in the list. A database search will look for proteins containing peptides identical or homologous to the listed sequences. Edman sequencing data can also be input into the program for protein identification. The quality of the results will be dependent on the number of peptides sequenced and the accuracy of the sequence information entered, as well as database completeness and species to species sequence variability for the peptides entered. In order for the match to be successful, a homologous protein must be in the database.


Possible Sequences

Sets of peptides obtained from de-novo sequencing experiments can be entered as follows:

1 ATDGVVLAAEQK 6
1 ATDGVVIAAEQK 6
2 LVQLEYATTAASK 7
2 LVQIEYATTAASK 7
2 IVQLEYATTAASK 7
2 IVQIEYATTAASK 7
3 SESSYGLTTFSPSGR 8
3 SESSYGITTFSPSGR 8
4 TTSPLADSLTLHK 7
4 TTSPIADSLTLHK 7
4 TTSPLADSITLHK 7
4 TTSPIADSITLHK 7
4 TTSPLADSLTIHK 7
4 TTSPIADSLTIHK 7
4 TTSPLADSITIHK 7
4 TTSPIADSITIHK 7

In the above example each line contains a spectrum number followed by one possible peptide sequence which could explain the spectrum. The sequence is followed by the maximum number of amino acid substitutions allowed for that sequence. In the above example 8 different sequences could equally explain spectrum 4, and up to 7 individual amino acid substitutions are allowed for each sequence in the homology search. Amino acid deletions or insertions will usually constitute multiple substitutions. For example, the homologous sequence TTSPLASLTLHK with a D deletion would require 5 amino acid substitutions to match the named sequence TTSPLADSLTLHK.

If all the sequences are from the same spectrum then this number can be left out. If you just enter a single sequence without a maximum number of errors then the maximum number of errors is assumed to be zero.

Sequence Syntax

A syntax similar to that developed to describe regular expressions in computer text processing has been adopted to facilitate entering a list of possible sequences in a compact manner.

Square brackets are used to bracket lists of possibilities that are separated by the | symbol. For example [I|L] means either I or L, [ELD|LDE|DEL] means ELD, LDE or DEL and [GG|N] means either the dipeptide GG or the amino acid N.

If only the composition of part of the peptide is known then it can be enclosed in curly brackets. The software will then generate sequences for each of the possible permutations. For example the expression {ACD} would generate the sequences ACD, ADC, CAD, CDA, DAC and DCA.

The first example in this section could have been entered as:

1 ATDGVV[I|L]AAEQK 6
2 [I|L]VQ[I|L]EYATTAASK 7
3 SESSYG[I|L]TTFSPSGR 8
4 TTSP[I|L]ADS[I|L]T[I|L]HK 7

Entering masses

It is also possible to enter a portion of the sequence as a mass along with a tolerance. The mass entered is the sum of the unknown amino acid residue masses, and care must be taken when considering N and C terminal fragments. For example, if the spectrum of a 6 residue peptide contains the b ion series from b1 to b6, but b4 is missing, the mass difference of b5-b3 can be entered directly. However, if b1 is missing, the mass should be entered as b2-1.008 to account for the N terminal Hydrogen. If any residues are modified, such as N terminal acetylation or cysteine carbamidylation, this must also be accounted for by the user before entering the mass into the program. The mass tolerance should be consistent with the mass accuracy of the original data.

The program converts the mass to all the possible corresponding compositions. For example the expression:

[213.1]ENFAGVGV[I|L]DFES

gets internally converted to:

[{GGV}|{GR}|{AAA}|{VN}]ENFAGVGV[I|L]DFES

In the above example a tolerance of 0.5 Da was used. This is a global parameter used for all the sequences entered.

It should be noted that using masses in an expression can generate a lot of permutations, many of which could be eliminated by interpretion of the spectra. For example if the MS/MS technique generates immonium ions which suggest that there is no arginine in the sequence then the GR composition in the above sequence would not be possible.


In order for a particular protein in the database to generate a hit it must find homologous sequences for the Minimum Number of Peptides Required to Match.


Fragment tolerance is the mass tolerance for fragment masses entered into the sequence to be searched. The fragment tolerance should be consistent with the mass accuracy of the original data.


The scoring method used is based on a mutation matrix like that used in the BLAST and FASTA programs. Users can choose between the PAM30, PAM70, PAM120, BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM62MS, BLOSUM80 and GONNET matrices or define additional matrices by editing a text file. The final score is calculated by adding the scores for the individual peptide alignments together. If there are several possible alignments of a given peptide then the highest scoring alignment is used in the calculation.