The identification of the names of genes and proteins is made especially difficult for several simultaneous reasons and the task cannot be reduced to the mere consultation of lists of gene names. First of all, the authors of the papers usually do not respect the naming conventions or nomenclatures. A same gene can then be designed by multiple names. Even in the cases these synonyms are correctly registreted, lexical variants can be encountered : for example, « AB 1 » instead of « AB1 », or « T-shirt » instead of « Tshirt ». Moreover, numerous gene names are also terms of the natural language, such as « if », « boss » or « arm ». Only the context of the occurence of a such a string in a text may therefore lead to the conclusion of the occurrence of a gene name.
In this context, the BioMiRe software associates several converging strategies. A pipe of natural language analysis tools is applied on the text in order to progressively filter the terms which appear. Gene name dictionnaries are looked up, contextual rules are applied and lexical variants are recognized using adequate matching algorithms.
The gene name dictionnaries have beeen compiled from different sources for four species : Arabidoposis thaliana, Drosophila melanogaster, Mus musculus and Homo sapiens. The contextual rules have been established through an analysis of a manually annotated corpus of Medline abstracts concerning these species. |