Médecine / Sciences 2002 ; 18 : 237-250
Claudine Médigue, Stéphanie Bocs, Laurent Labarre, Catherine Mathé, David Vallenet
Abstract: For the first time in history, we have access to the entire genetic content of a growing number and variety of living organisms. This explosive growth of information is forcing changes in many scientific disciplines, particularly in computational biology and molecular genetics. One of the challenges is to predict and annotate the functions of the gene products as rapidly and completely as possible, taking into account both molecular interactions and higher cellular order processes. The first level of sequence annotation consists in gene finding and functional prediction of their products using similarities searching in protein databanks. This step remains easier in the context of procaryotic genome analysis, the gene structure of these organisms being much more simple than the one of eucaryotes. Predicting function from sequence using computational tools is generally done for each gene individually. Others levels of annotation, such as the identification of interactions between genomic elements characterized in the first step, are more difficult to achieve. If we currently best described the protein function in the context of molecular interactions, it will be possible in the near future to predict function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signaling cascades. Besides the information from the completely sequenced genomes, the latter analysis also uses additional information from proteomics and expression data. New infrastructures that integrate various levels of sequence annotation and function prediction are clearly required. This paper focuses on the various facets of the in silico sequence annotation, which is far from being perfect despite the fact that sequencing itself is highly automated and accurate, and despite the fact that (or maybe because…) sequence information is described in simple linear form, using a four-letter alphabet. There remains a long way to go until we are able to describe molecular processes quantitatively. However, there is no doubt that in silico sequence analysis is extremely powerful, and the generation of hypothesis derived by computational methods will be more and more often the first successful step in the design of in vivo/in vitro experiments.