In eukaryotic organisms, the situation is a great deal more complicated, because the coding regions represent only a small percentage of the total genome sequence (3 to 5 % in mammals), mostly because a eukaryotic gene is made up of several coding regions called exons, separated by non-coding regions called introns (fig.1). So the strategy used for bacteria does not work, and in order to identify the coding sequences we have to turn to other properties of genes, which are less strictly defined and thus less efficient. Firstly, the fact that a sequence codes for a protein imposes constraints which make bases more likely to appear in certain orders than in others. Secondly, the cellular machinery recognises the boundaries between exons and introns thanks to particular arrangements of consecutive bases, which the software may learn from known examples. Of the mathematical tools currently available, Markov models seem to manage these two sorts of information most efficiently (see inset "Markov models".) But there are many others. As none of them is completely satisfactory, it is advisable to combine the results of several complementary or even rival methods. It is only thanks to this strategy that it is becoming possible to make a reasonably accurate prediction of a complete gene (ie the succession of introns and exons) and then to reconstruct the coded protein or proteins, as well as the various regions involved in transcription and translation.
Using these advanced research strategies produces fairly reliable results in prokaryotic genome analysis, but there is still a long way to go for eukaryotic genomes. How can we be sure that the computer predictions are correct? Computational data (in silico or 'dry lab') must be compared against biological data (in vitro and in vivo or 'wet lab'). For example, when a gene is expressed it is transcribed into RNA before being translated into proteins. This RNA can be recovered and sequenced. It does not contain introns, and can be compared to the genome sequences. Jean Thierry-Mieg, who took part in sequencing the nematode C. elegans, has shown that about 50% of the predictions were wrong, sometimes significantly so [1]. It also appears that rather than the 18 000 genes originally predicted, there are only in fact 12 000. An error rate of 50% was also found for one of the very first prokaryotes to be sequenced, Mycoplasma pneumoniae, even though the process should theoretically be simpler, as we have seen. This last figure takes into account errors in gene function attribution [2] |