The exact number of genomic and post-genomic databases is not known, but it probably exceeds a thousand. Most of them are accessible on the Web. They gather various types of data, such as sequences, 2D and 3D structures, functionnal annotations or descriptions of metabolic pathways, and organise them diversely according to the problematics which motivated their design.
In spite of this volume and this diversity, and of the increasing efforts to formalize biological knowledge, most of the biological data and the associated knowledge have still to be found in texts, either as papers in the litterature or as comments within these databases themselves.
In this context, the aim of information extraction (IE) techniques is to select pertinent sentences within a text and to extract from these sentences structured facts which can be stored into databases. For example, a straightforward, but already quite complex, problem consists in extracting data on genetic interactions from texts. In this case, a fact could be a triplet : the two gene or gene products in interaction and the type of interaction (activation or inhibition).
The state of the art in natural language analysis dismisses any hope for a generic solution of the information extraction problem. Specific terminological and ontological ressources have to be build for every new kind of facts to be extracted. The objective of the research in this domain is therefore to design systems which can be quite easily customized and tuned for a given problematics.
There are two main approaches to information extraction. The first one essentially relies on statistical analysis of the proper nouns and specialized terms which occur in a text. For example, the occurrence of two gene names or symbols in the same sentence could be a very simple hint for the presence of an interaction between the two genes. The presence in the same sentence of terms which have been previously associated with the description of interactions, such as « interact » or « inhibit », may increase the level of confidence.
The second approach makes use of natural language analysis tools, such as tokenizers, parsers and part-of-speech taggers. They are expected to allow the extraction of complex facts, but they are extremely demanding in terms of linguistics and computing ressources.
The Helix group has the experience of both approaches, but is presently investigating the second one in the context of a tight collaboration with the Xerox Research Center Europe (XRCE) in Meylan (Grenoble).