The Helix research group
Research themes
Work in progress and results
Software and databases
News from Helix
Home page
Site map Mail to Helix
Bases of motifs for generating repeated patterns with don't cares
Internal Report University of Pisa, TR-03-02, January 24th, 2003
N. Pisanti, M, Crochemore, R. Grossi, M.-F. Sagot
Identifying repetitive patterns in strings (sequences) is a computationally-demanding task on the large data sets available in computational biology, data mining, textual document processing, system security, and other areas. We consider patterns with don't cares in a given string s of n symbols drawn over an alphabet Sigma. The don't care is a special symbol "*" matching any symbol of Sigma; for example, pattern T*E matches both TTE and TEE inside s = COMMITTEE. (Note that a pattern cannot have a don't care at the beginning or at the end, as this is not considered informative.) Contrarily to string matching with don't cares, the pattern T*E is not given in advance for searching s. Instead, the patterns with don't cares appearing repeated in s are unknown and, as such, have to be discovered and extracted by processing s efficiently. In our example, T*E and M**T*E are among the patterns appearing repeated in COMMITTEE. In this paper we focus on the patterns called motifs, which appear at least q times in s for an input parameter q >= 2 called the quorum.
    Top of page   Home page  Prepare to print