A prokaryotic genome is fairly dense - almost the entire sequence corresponds to genes - and we know the codons (sets of three nucleotides) which mark the beginning and end of translation of a region which codes for a protein. But unfortunately it is not that simple, as there are certain ambiguities: for example, the codons which mark the beginning of translation also code for an amino acid. ATG, the most common start codon, codes for methionine. So there is only one possible "necessary condition" defining where to look for a coding sequence: between two codons which mark the end of translation (known as STOP codons), in what is called an Open Reading Frame (ORF).
Any sequence included in an ORF which begins with a START codon and which is judged to be long enough (for example 300 nucleotides for a prokaryote, which corresponds to a protein of 100 amino acids) is considered to be a potential coding region. If significant sub-sequences, particularly a promoter or a ribosome binding site, are found upstream from this region, this supports the hypotheses, as does the existence of similar sequences in the nucleotide and protein bases. Finally, the same sequence can be "read" in three different ways, grouping the letters in threes, codon by codon, and each of the two complementary strands of DNA can be read, so that in practice the search for coding regions must be carried out on six different virtual sequences. Together with Antoine Danchin's group at the Institut Pasteur, the authors have developed software tools to facilitate genome analysis [1], but there are many others. |