Helix bioinformatics


	Context and situation

	Research activities

	Partnerships

	Teaching activities

	Members

	Former members


	Evolution of species and gene families

	Spatial organization of genomic information

	Syntaxic and functionnal genome annotation

	Proteomics

	Modeling and simulation of genetic regulatory networks

	Information extraction from texts


	Evolution of gene and gene families

	Spatial organization of genomic information

	Syntaxic and functionnal genome annotation

	Proteomics

	Modeling and simulation of genetic regulatory networks

	Information extraction from text


	Publications by year

	Publications by author

	Export


	The GenoStar integrated bioinformatics platform for exploratory genomics

	GEB: GenoExpertBacteria

	GNA: Genetic Network Analyzer

	PepLine: high throughput proteomics

	Herbs: checking the consistency of proteome annotations

	ISee: In Silico biology e-learning environment

	BOX: XML specifications of genomic data

	AROM: entity-relationship knowledge modeling


	Software and database releases

	Talks, seminars, poster presentations,...

	PhD and Master thesis defenses

	Training and job opportunities

Work in progress and results > Information extraction from text

Identification of gene, proteins and species names

The identification of the names of genes and proteins is made especially difficult for several simultaneous reasons and the task cannot be reduced to the mere consultation of lists of gene names. First of all, the authors of the papers usually do not respect the naming conventions or nomenclatures. A same gene can then be designed by multiple names. Even in the cases these synonyms are correctly registreted, lexical variants can be encountered : for example, « AB 1 » instead of « AB1 », or « T-shirt » instead of « Tshirt ». Moreover, numerous gene names are also terms of the natural language, such as « if », « boss » or « arm ». Only the context of the occurence of a such a string in a text may therefore lead to the conclusion of the occurrence of a gene name.

In this context, the BioMiRe software associates several converging strategies. A pipe of natural language analysis tools is applied on the text in order to progressively filter the terms which appear. Gene name dictionnaries are looked up, contextual rules are applied and lexical variants are recognized using adequate matching algorithms.

The gene name dictionnaries have beeen compiled from different sources for four species : Arabidoposis thaliana, Drosophila melanogaster, Mus musculus and Homo sapiens. The contextual rules have been established through an analysis of a manually annotated corpus of Medline abstracts concerning these species.

	The identification of names in a perfect world
	Identification of gene, proteins and species names

	Identification of gene, proteins and species names
	XRCE (Xerox Research Center Europe)