Helix bioinformatics


	Context and situation

	Research activities

	Partnerships

	Teaching activities

	Members

	Former members


	Evolution of species and gene families

	Spatial organization of genomic information

	Syntaxic and functionnal genome annotation

	Proteomics

	Modeling and simulation of genetic regulatory networks

	Information extraction from texts


	Evolution of gene and gene families

	Spatial organization of genomic information

	Syntaxic and functionnal genome annotation

	Proteomics

	Modeling and simulation of genetic regulatory networks

	Information extraction from text


	Publications by year

	Publications by author

	Export


	The GenoStar integrated bioinformatics platform for exploratory genomics

	GEB: GenoExpertBacteria

	GNA: Genetic Network Analyzer

	PepLine: high throughput proteomics

	Herbs: checking the consistency of proteome annotations

	ISee: In Silico biology e-learning environment

	BOX: XML specifications of genomic data

	AROM: entity-relationship knowledge modeling


	Software and database releases

	Talks, seminars, poster presentations,...

	PhD and Master thesis defenses

	Training and job opportunities

What is bioinformatics? > A short introduction to bioinformatics

Genomic databases


		Translated from "Donner un sens au génome", La Recherche, n° 332, June 2000

Part of the sequences is deposited in databanks which are freely accessible via the Internet. Three banks - EMBL in Europe (maintained by the European Bioinformatics Institute (EBI) at Hinxton near Cambridge), GenBank (maintained by the National Center for Biotechnology Information (NCBI) in the United States), and the DNA Data Bank of Japan (DDBJ) in Japan share their data, and in practice form a single bank with three entry points. GenBank's February 2000 version holds 5.7 million sequences, a total of 5.8 billion nucleotides long, and the size of the bank now doubles every seven months, at a rate of 15 million new bases per day. It is obviously impossible to put a figure on the very large number of sequences not held in these banks, for confidentiality reasons related to the economic interests at stake. The human genome sequence which Craig Venter and his firm Celera say they have completed is not yet accessible either for the time being, but it should be soon - publication in the scientific journals is expected at the end of 2000.

Each sequence has attached to it various information called "annotations". This naturally includes the source organism, but also, where some of the genes have been identified experimentally or by computational analysis, a brief description of their function, as well as bibliographical links. One good thing about these banks is that they bring together all the publicly available sequences, but they do have several shortcomings. The quality of the sequences varies, and some of the data are redundant - there may be several copies of the same section of the genome of a given organism, sequenced and deposited by different laboratories. There is little logical structure to the annotations, so it is difficult to interpret them by computer, and these too are of very variable quality. Because of this, a number of specialised databases are growing up parallel to these banks. Some bring together sequences which relate to the same organism, for example SubtiList and NRSub for the bacterium Bacillus subtilis, Cyanobase for the bacterium Synechocystis, TAIR for the plant Arabidopsis thaliana. Others group together complementary annotations, cutting across various different sequence databases. This is the case with FlyBase, for the drosophila, MGD (Mouse Genome Database) for the mouse and GDB (Genome Data Base) for the human genome. Others concentrate on a particular class of sequences, but for a group of organisms. The Eukaryotic Promoter Database (EPD) brings together sequences for promoters from eukaryotic organisms. Finally, there are several databases devoted to proteins. SwissProt in Geneva is maintained by the group led by Amos Bairoch, in collaboration with the EBI, and contains more than 80 000 sequences relating to several hundred different organisms. Access to all these data on the Web has significantly changed biologists' research strategies.

	The first genome projects
	Whole genome sequencing
	Genomic databases
	The problem of heterogeneous databases
	Searching for homology through similarity of sequences
	Finding genes in procaryotic genomes
	Finding genes in eucaryotic genomes
	Inferring gene functions from homology relationships
	The quest for gene fonction has not yet found an algorithmic solution
	Modeling and simulating gene interaction networks and metabolic pathways
	Biological data and knowlege need to be formalized