In the past few years, the main reason for the exponential growth of protein databases has been the sequencing of complete genomes. Currently one genome is submitted to the DNA databases about every two weeks. They vary greatly in size, from about 500 protein coding sequences (CDS) in 0.58 megabases (Mb), for Mycoplasma genitalium, to about 8400 CDS in 9.11 Mb, for Bradyrhizobium japonicum. Thus, in contrast with the situation 10 years ago, a huge fraction of protein sequences have no experimental characterization data. Prokaryotic genomes tend to have excellent prediction rates for coding sequences, but the quality of the functional annotation of the predicted genes is very variable.
Due to the amount of data involved, we have decided to supplement the traditional curation process of Swiss-Prot with a semi-automatic annotation approach that closely interacts with human experts. This process named HAMAP for "High-quality Automated and Manual Annotation of microbial Proteomes" aimed at achieving the same quality of annotation generated by manual curation rather than maximal coverage. Thus, the procedure provides many checks in order to prevent the propagation of wrong annotation at protein level and to spot problematic cases, which are channelled to manual curation. This annotation process is only applied on two very distinct subsets of proteins: sequences with no detectable similarity and members of well defined curated families whose functions are known (ex: proteins involved in a metabolic process).
As complete prokaryotic proteomes (i.e. set of all annotated proteins of the organism) are available, it is currently possible to check the consistency of annotation at the organism level. In order to help annotators to deal with this task, a specific inference system using expert rules and metabolic knowledge on microbial organisms is being developed. This system is named Herbs for "HAMAP Expert Rule Based System". |