Helix bioinformatics


	Context and situation

	Research activities

	Partnerships

	Teaching activities

	Members

	Former members


	Evolution of species and gene families

	Spatial organization of genomic information

	Syntaxic and functionnal genome annotation

	Proteomics

	Modeling and simulation of genetic regulatory networks

	Information extraction from texts


	Evolution of gene and gene families

	Spatial organization of genomic information

	Syntaxic and functionnal genome annotation

	Proteomics

	Modeling and simulation of genetic regulatory networks

	Information extraction from text


	Publications by year

	Publications by author

	Export


	The GenoStar integrated bioinformatics platform for exploratory genomics

	GEB: GenoExpertBacteria

	GNA: Genetic Network Analyzer

	PepLine: high throughput proteomics

	Herbs: checking the consistency of proteome annotations

	ISee: In Silico biology e-learning environment

	BOX: XML specifications of genomic data

	AROM: entity-relationship knowledge modeling


	Software and database releases

	Talks, seminars, poster presentations,...

	PhD and Master thesis defenses

	Training and job opportunities

Research themes

Information extraction from texts


		Text mining : how to select pertinent sentences within a text and to extract from these sentences structured facts which can be stored into databases ?

The exact number of genomic and post-genomic databases is not known, but it probably exceeds a thousand. Most of them are accessible on the Web. They gather various types of data, such as sequences, 2D and 3D structures, functionnal annotations or descriptions of metabolic pathways, and organise them diversely according to the problematics which motivated their design.

In spite of this volume and this diversity, and of the increasing efforts to formalize biological knowledge, most of the biological data and the associated knowledge have still to be found in texts, either as papers in the litterature or as comments within these databases themselves.

In this context, the aim of information extraction (IE) techniques is to select pertinent sentences within a text and to extract from these sentences structured facts which can be stored into databases. For example, a straightforward, but already quite complex, problem consists in extracting data on genetic interactions from texts. In this case, a fact could be a triplet : the two gene or gene products in interaction and the type of interaction (activation or inhibition).

The state of the art in natural language analysis dismisses any hope for a generic solution of the information extraction problem. Specific terminological and ontological ressources have to be build for every new kind of facts to be extracted. The objective of the research in this domain is therefore to design systems which can be quite easily customized and tuned for a given problematics.

There are two main approaches to information extraction. The first one essentially relies on statistical analysis of the proper nouns and specialized terms which occur in a text. For example, the occurrence of two gene names or symbols in the same sentence could be a very simple hint for the presence of an interaction between the two genes. The presence in the same sentence of terms which have been previously associated with the description of interactions, such as « interact » or « inhibit », may increase the level of confidence.

The second approach makes use of natural language analysis tools, such as tokenizers, parsers and part-of-speech taggers. They are expected to allow the extraction of complex facts, but they are extremely demanding in terms of linguistics and computing ressources.

The Helix group has the experience of both approaches, but is presently investigating the second one in the context of a tight collaboration with the Xerox Research Center Europe (XRCE) in Meylan (Grenoble).


	Examples of sentences