Introduction and use case
Modern biology faces the production of a huge amount of data. Stored in databases, those data have to be easily and, idealy, automaticaly processed. Computer science does not only provide tools to store data, but also tools to modelize and exchange the biological data.
In the computer science world, the questions of representation and exchange of data cannot be separated one from the other and are therefore important issues. Using XML Schema, we have built a UML based model that allows biological data representation. The use of XML is a guaranty that data will be easy to exchange and to process with common and standardized tools.
Let us take the example of an EMBL database entry. It is formatted as a flat file, each line identified by a 2 letters code, and plenty of hard to process natural language. If a biologist wants to process one of those files, he has to write a complex parser to manipulate the data. Converted to the BOXml format, he will be able to use one of the numerous, free, existing XML parsers. From the BOXml format, transformation can be done to other formats, such as HTML if display is wanted. Also, there is a lot of XML aware tools, such as JAXB (a Java XML Binding tool), which gives the ability to manipulate XML as Java objects, and therefore eases a potential integration into Java applications.
One of ORIEL's objectives is the definition of strict standards of representation and exchange of the biological factual information. In this context the BOX project adresses the following points:
- The questions of the representation and of the exchange of factual data cannot be separated. A document (exchange media) containing biological data is always built upon an implicit or explicit model (representation media) of these data. The exchange model is often explicit (e.g. flat file format) but the representation model is not. This often gives rise to ambiguities or difficulties when retrieving the information. In BOX we want to make the representation model as explicit as possible and its link to the exchange model as unambiguous as possible.
- Using well accepted standards is an obvious requirement (unless these standards do not fulfill the objective). In this context, we make use of UML for the representation models and of W3C XML-Schema (WXS) for exchange models.
- The BOX specifications (both for representation and for exchange) should be strict. This means that a document not conforming to any one of the models should be rejected by a simple generic parser. The validity of a document is not only a question of syntax (e.g. typing "geene" instead of "gene"), but also a question of semantics (e.g. asserting that one gene belongs to more than one organism). In the BOX exchange model, we set up explicit constraints (e.g. identity and cardinality constraints) for a valid document
- The BOX specifications (both for representation and for exchange) should be "pragmatic". This means that already existing data should easily be represented using BOX. Defining models so sophisticated that no existing data can fit into them would indeed be of poor practical interest.
BOX is primarily a core library of XML-Schema specifications. Its first goal is therefore to provide XML-Schema components (rather than definitive format specifications) to XML designers. To this purpose BOX is organised as modules (15 modules in BOX version 1.0) that may be reused either in BOX or in other XML projects.
Defining a standard (especially in biology) is a hard task, since standards only become so when people start actually using them. The current version of BOX has been designed by people at INRIA Rhône-Alpes. Although we took great care of their overall design, these specifications should not be considered as definitive. Indeed, we wish that the material contained in BOX (i.e. specifications and documentation) be made available to the community in such a way that everybody who feels concerned by this work could participate actively.