The Helix research group
Research themes
Work in progress and results
Software and databases
News from Helix
What is bioinformatics? > A short introduction to bioinformatics
Home page
Site map Mail to Helix
Genomic databases
Translated from "Donner un sens au génome", La Recherche, n° 332, June 2000

Part of the sequences is deposited in databanks which are freely accessible via the Internet. Three banks - EMBL in Europe (maintained by the European Bioinformatics Institute (EBI) at Hinxton near Cambridge), GenBank (maintained by the National Center for Biotechnology Information (NCBI) in the United States), and the DNA Data Bank of Japan (DDBJ) in Japan share their data, and in practice form a single bank with three entry points. GenBank's February 2000 version holds 5.7 million sequences, a total of 5.8 billion nucleotides long, and the size of the bank now doubles every seven months, at a rate of 15 million new bases per day. It is obviously impossible to put a figure on the very large number of sequences not held in these banks, for confidentiality reasons related to the economic interests at stake. The human genome sequence which Craig Venter and his firm Celera say they have completed is not yet accessible either for the time being, but it should be soon - publication in the scientific journals is expected at the end of 2000.

Each sequence has attached to it various information called "annotations". This naturally includes the source organism, but also, where some of the genes have been identified experimentally or by computational analysis, a brief description of their function, as well as bibliographical links. One good thing about these banks is that they bring together all the publicly available sequences, but they do have several shortcomings. The quality of the sequences varies, and some of the data are redundant - there may be several copies of the same section of the genome of a given organism, sequenced and deposited by different laboratories. There is little logical structure to the annotations, so it is difficult to interpret them by computer, and these too are of very variable quality. Because of this, a number of specialised databases are growing up parallel to these banks. Some bring together sequences which relate to the same organism, for example SubtiList and NRSub for the bacterium Bacillus subtilis, Cyanobase for the bacterium Synechocystis, TAIR for the plant Arabidopsis thaliana. Others group together complementary annotations, cutting across various different sequence databases. This is the case with FlyBase, for the drosophila, MGD (Mouse Genome Database) for the mouse and GDB (Genome Data Base) for the human genome. Others concentrate on a particular class of sequences, but for a group of organisms. The Eukaryotic Promoter Database (EPD) brings together sequences for promoters from eukaryotic organisms. Finally, there are several databases devoted to proteins. SwissProt in Geneva is maintained by the group led by Amos Bairoch, in collaboration with the EBI, and contains more than 80 000 sequences relating to several hundred different organisms. Access to all these data on the Web has significantly changed biologists' research strategies.

In the same section
The first genome projects
Whole genome sequencing
Genomic databases
The problem of heterogeneous databases
Searching for homology through similarity of sequences
Finding genes in procaryotic genomes
Finding genes in eucaryotic genomes
Inferring gene functions from homology relationships
The quest for gene fonction has not yet found an algorithmic solution
Modeling and simulating gene interaction networks and metabolic pathways
Biological data and knowlege need to be formalized
    Top of page   Home page  Prepare to print