With the recent improvements of MS/MS QTOF spectrometers biologists can now generate very large amounts of spectral data (up to 1500 peptides per day) that can no longer be analyzed manually. There is therefore a growing need for computer systems (pipelines) allowing fully automated protein identification from raw MS/MS data. So far, two main approaches have been proposed to this purpose:
1) Direct identification that consists in the comparison of the raw MS/MS spectrum with all entries of a virtual MS/MS spectra database.
2) Indirect identification which involves two successive steps i) MS/MS spectrum interpretation (i.e. determination of amino acid sequences like in the de novo sequencing approach) followed by ii) protein identification from the corresponding peptides.
This paper presents an approach for automatic protein identification dedicated to high-throughput proteomics. This approach follows the line of the indirect protein identification method but, unlike de novo sequencing, does not require the determination of long sequence stretches . It is based on the concept of Protein Sequence Tag (PST). In order to fully exploit this concept, we designed two complementary software modules: Taggor for PSTs generation from spectra (MS/MS data interpretation) and PepMap for PSTs localization on protein or genomic data (protein/gene identification).