Another problem in the quest for gene functions stems from the fact that the merging of fragments originating from different genes allows totally new functions to emerge. This is what François Jacob meant by "evolutionary tinkering". In addition to these problems linked to the way living systems function, there are others which arise from the fact that the available sequence databases are incomplete and contain errors. For these reasons, the results produced by software are no more than hypotheses, which must in turn be experimentally tested in the laboratory, in particular by observing the effects of the substitution or deletion of a gene in the organism, or in one related to it. This is why the priority given to the human genome has sometimes been criticised. Some think it would be better to begin by sequencing and analysing the mouse genome, which has large numbers of genes homologous with human genes, and which can be experimented on, rather than to tackle the human genome straight away, with the risk of accumulating hypotheses which cannot be validated in the short term. Whatever the answer, given the inadequacy of a purely computational approach, determining the function of genes (or rather of the proteins they code for) is now a matter for the experts. As soon as the drosophila genome had been sequenced, Craig Venter hosted what he called a "jamboree" for forty-five of the world's top specialists in fly genetics, bio-informatics and proteins, where they spent eleven days comparing their opinions on the raw sequence he had just obtained in collaboration with more than thirty teams around the world. It was only after this brain-storming session that an annotated sequence was submitted to the rest of the scientific community, and published in the journal Science. Clearly, systematising this "annotation" process is a considerable challenge for bioinformatics. Once we think we have identified the sequence of a gene, what is the best way to fit together data and knowledge of various kinds and various origins, relating to several organisms, in order to predict the functions of that gene?
In the "anything goes" strategy, one key element is the way data and information are structured within computer systems, whose powerful capabilities allow the researcher to search and browse, to visualise data from a different perspective, and thus to draw new inferences. Although it is easy to store basic data such as sequences, the computational representation of data about functions, for example those which relate to metabolic pathways, is still a problem for bioinformatics research. A look at the KEGG database will confirm this - here, the data are only presented as images, available "at a click", certainly, but impossible to process using software. |