Steadily increasing progresses in experimental techniques and devices has lead to the so-called « high throughput » production of genomic and post-genomic data.
Genomic data essentially consist in DNA sequences, which can be seen, from the computer scientist point of view, as strings written in the 4-letter nucleotidic alphabet. Whole genomes are now available. Their length, computed as the number of « letters » of the related sequence, is of the order of millions for bacterial genomes up to the order of billions and higher for eukaryotic genomes. But knowing the sequence is only a starting point. The biologist has first to search for the regions which code for the proteins or which are involved in the regulation of the transcription processes. But there are many other regions of interest in a genome, which need to be identified, including in the non-coding parts, which can provide valuable clues for evolutionary studies.
The term « post-genomic » appears to denote all the data which are related to the various biological entities which are involved in the expression of genetic information. They thus include data on transcriptomes, proteomes, genetic interaction networks and metabolic pathways. These data take multiple forms: images, matrices, graphs, strings in different alphabets, 2D and 3D geometric descriptions, etc.
In this context, the need for adapted computer methods appears clearly, as a consequence of the large volumes of data, but also of the wide diversity of their nature. These data have to be stored, structured and managed within databases and the results of their analysis feed knowledge bases.
The research activity in bioinformatics thus aims on one hand at designing adequate data and knowledge models for these data and knowledge bases, on the other hand at developping algorithmic and statistical methods for data analysis. These efforts converge toward the development of integrated bioinformatics environments.