The Helix research group
Research themes
Work in progress and results
Publications
Software and databases
News from Helix
Work in progress and results > Information extraction from text
Home page
Site map Mail to Helix
Identification of gene, proteins and species names
 

The identification of the names of genes and proteins is made especially difficult for several simultaneous reasons and the task cannot be reduced to the mere consultation of lists of gene names. First of all, the authors of the papers usually do not respect the naming conventions or nomenclatures. A same gene can then be designed by multiple names. Even in the cases these synonyms are correctly registreted, lexical variants can be encountered : for example, « AB 1 » instead of « AB1 », or « T-shirt » instead of « Tshirt ». Moreover, numerous gene names are also terms of the natural language, such as « if », « boss » or « arm ». Only the context of the occurence of a such a string in a text may therefore lead to the conclusion of the occurrence of a gene name.

In this context, the BioMiRe software associates several converging strategies. A pipe of natural language analysis tools is applied on the text in order to progressively filter the terms which appear. Gene name dictionnaries are looked up, contextual rules are applied and lexical variants are recognized using adequate matching algorithms.

The gene name dictionnaries have beeen compiled from different sources for four species : Arabidoposis thaliana, Drosophila melanogaster, Mus musculus and Homo sapiens. The contextual rules have been established through an analysis of a manually annotated corpus of Medline abstracts concerning these species.

 
In the same section
The identification of names in a perfect world
Identification of gene, proteins and species names
 
On the same subject
Identification of gene, proteins and species names
XRCE (Xerox Research Center Europe)
    Top of page   Home page  Prepare to print