The Helix research group
Research themes
Work in progress and results
Software and databases
News from Helix
What is bioinformatics? > A short introduction to bioinformatics
Home page
Site map Mail to Helix
Finding genes in eucaryotic genomes

In eukaryotic organisms, the situation is a great deal more complicated, because the coding regions represent only a small percentage of the total genome sequence (3 to 5 % in mammals), mostly because a eukaryotic gene is made up of several coding regions called exons, separated by non-coding regions called introns (fig.1). So the strategy used for bacteria does not work, and in order to identify the coding sequences we have to turn to other properties of genes, which are less strictly defined and thus less efficient. Firstly, the fact that a sequence codes for a protein imposes constraints which make bases more likely to appear in certain orders than in others. Secondly, the cellular machinery recognises the boundaries between exons and introns thanks to particular arrangements of consecutive bases, which the software may learn from known examples. Of the mathematical tools currently available, Markov models seem to manage these two sorts of information most efficiently (see inset "Markov models".) But there are many others. As none of them is completely satisfactory, it is advisable to combine the results of several complementary or even rival methods. It is only thanks to this strategy that it is becoming possible to make a reasonably accurate prediction of a complete gene (ie the succession of introns and exons) and then to reconstruct the coded protein or proteins, as well as the various regions involved in transcription and translation.

Using these advanced research strategies produces fairly reliable results in prokaryotic genome analysis, but there is still a long way to go for eukaryotic genomes. How can we be sure that the computer predictions are correct? Computational data (in silico or 'dry lab') must be compared against biological data (in vitro and in vivo or 'wet lab'). For example, when a gene is expressed it is transcribed into RNA before being translated into proteins. This RNA can be recovered and sequenced. It does not contain introns, and can be compared to the genome sequences. Jean Thierry-Mieg, who took part in sequencing the nematode C. elegans, has shown that about 50% of the predictions were wrong, sometimes significantly so  [1]. It also appears that rather than the 18 000 genes originally predicted, there are only in fact 12 000. An error rate of 50% was also found for one of the very first prokaryotes to be sequenced, Mycoplasma pneumoniae, even though the process should theoretically be simpler, as we have seen. This last figure takes into account errors in gene function attribution  [2]

[1] Thierry-Mieg, J. et al; Programme Génome/CNRS

[2] Venter, C. & Bork, P. Conference papers, Pasteur Institute conference Génome 2000

In the same section
The first genome projects
Whole genome sequencing
Genomic databases
The problem of heterogeneous databases
Searching for homology through similarity of sequences
Finding genes in procaryotic genomes
Finding genes in eucaryotic genomes
Inferring gene functions from homology relationships
The quest for gene fonction has not yet found an algorithmic solution
Modeling and simulating gene interaction networks and metabolic pathways
Biological data and knowlege need to be formalized
    Top of page   Home page  Prepare to print