XML Databases - The Business Case

The history of databases and a way forward for information exchange

For three decades, application developers have relied on relational databases as the bedrock for a persistent data storage layer. While the technology is mature, today's requirements are becoming more complex and relational databases may not be the tool for the job in hand, but what else does a designer / developer pick if they know no better? - Relational Databases were developed in the days of procedural programming languages (e.g. C, COBOL and RPG), programming techniques have evolved in many ways since 30 years ago most notably with introduction of an Object Oriented approach but the persistent storage model has stayed the same. This article tries to question if developers have been dumbing down and creating more work for themselves (unknowingly) for many years, this article also attempts to give an eye-opener into a new approach of storing and retrieving data.

Commonly today, data structures are often modelled in a hierarchical object manner, imagine a simple invoice in terms of an object hierarchy:

Simple Invoice, Theoretical Business Object

Invoice = {
  date : "2008-05-24"
  invoiceNumber : 421

  InvoiceItems : {
    Item : {
      description : "Wool Paddock Shet Ret Double Bound Yellow 4'0"
      quantity : 1
      unitPrice : 105.00
    }
    Item : {
      description : "Wool Race Roller and Breastplate Red Double"
      quantity : 1
      unitPrice : 75.00
    }
    Item : {
      description : "Paddock Jacket Red Size Medium Inc Embroidery"
      quantity : 2
      unitPrice : 67.50
    }
  }
}

The following is an example relational structure, containing this data

Table Invoices
dateinvoiceId
2008-05-24421

Table InvoiceItems
invoiceIddescriptionquantityunitPrice
421Wool Paddock Shet Ret Double Bound ...1105.00
421Wool Race Roller and Breastplate Red ...175.00
421Paddock Jacket Red Size Medium Inc ...267.50

Representing this simple single Invoice Object in a relational database can be done, but immediately even for something this simple you need more than 1 table, table joins based on keys and of course the Object has to be spanned over multiple tables. This leaves room for human error; when inserting and updating data it is up to the developer to ensure keys correctly match and when trying to rebuild the object from the persistent layer you need an SQL query which will select data from multiple tables, by nature the query returns the data as essentially a result set of flat 1 dimensional arrays and its then up to the developer to build this hierarchical object from scratch.

To a programmer who has been developing with relational databases for some time this may seem like second nature but for a new developer that has just learned the concepts of Object Oriented programming this may seem a little alien.

Leaving aside the programmer's responsibility to ensure the mapping between Object and relational structures, because the data types in SQL databases are quite simplistic all validation must be performed within the business logic layer of an application before any data can be inserted or updated in the database.

SQL "CREATE TABLE" and the SQL data type values a developer can bind to each column is too simplistic to be used as a means of validating data taken directly from a user's input. Often the business logic layer in today's applications performs additional validation, e.g. checks that a field is a valid phone number or a valid e-mail address or even that when the field is inserted into the SQL INSERT or UPDATE statement that it won't actually break the syntax or cause a security breach.

Object Relational Mapping has definitely eased these problems with relational databases because it allows a relational database to become a "virtual object database", but O/R Mapping has brought some problems of its own. O/R Mapping techniques and frameworks can be difficult to learn, it is by no means simple to map complex Java classes with multiple Java class descendants to a relational structure, validating user's input is still cumbersome and essentially still needs to be written in full in the business logic layer and it of course adds an additional performance overhead because essentially the O/R mapping process attempts to emulate the natural functionality of an Object oriented database.

Object oriented databases are designed to work well with object oriented programming languages such as Java, C# and C++. Object Databases use the same model as today's programming languages as they store and index theoretical objects. Object databases are generally recommended when there is a business need for high performance processing on complex data.

What has held Object databases back over the years is A. The industries resilience to change. B. The majority of developers in the industry can't be bothered to investigate about new or alternative technologies to the ones that are common place in industry.

However, thankfully change does happen. Today we are living in the information age, businesses are talking to each other via complex XML data structures, (SOAP and RESTful Web Services becoming the ever more popular means of information exchange between disparate applications and systems).

The XML messages exchanged are by nature hierarchical and deeply tree structured, sometimes the data is even unpredictable and sometimes the structure is prone to change at any time, developers trying to map this data to a relational structure may find their lives becoming more and more difficult.

XML Databases offer the same functionality of Object Databases, data is structured in a hierarchical manner except XML Databases store XML documents instead of theoretical Objects. While in principle this is the same concept of data storage, XML databases have the added benefit of being able exchange the data in its native format, which is perfect for today's requirements.

Where Object Databases have Object Query Language (OQL), XML Databases have XQuery which is a W3C standard. XQuery covers the major functionality from former language proposals like XML-QL, XQL, OQL and the SQL standard.

Going back to the Invoice object and a persistent layer. A developer working with an XML Database would just need to place an XML representation of the Object into a collection.

The following is an example of the invoice data but stored in XML format

Simple Invoice, XML Representation

<invoice>

 <number>421</number>

 <date>2008-05-24</date>

 <items>

  <item>
   <description>Wool Paddock Shet Ret Double Bound Yellow 4'0"</description>
   <quantity>1</quantity>
   <unitPrice>105.00</unitPrice>
  </item>

  <item>
   <description>Wool Race Roller and Breastplate Red Double</description>
   <quantity>1</quantity>
   <unitPrice>75.00</unitPrice>
  </item>

  <item>
   <description>Paddock Jacket Red Size Medium Inc Embroidery</description>
   <quantity>2</quantity>
   <unitPrice>67.50</unitPrice>
  </item>

 </items>

</invoice>

Pulling up the full invoice from the XML Database requires no long winded table joins, it is as simple as:

XQuery

collection("invoices")/invoice[number=421]

Pretty simple when you compare it to the equivilant SQL for Relational Databases:

Equivilant SQL

select * from invoiceitems inner join invoices on
invoiceitems.invoiceid = invoices.invoiceid where invoices.invoiceid = 421;

XML Databases can accept structured as well as unstructured data. XML documents do not have to conform to any set Schema so a developer can fire anything they wish at the database, no need to modify tables and columns. On the other hand, XML may conform to an XML Schema.

XML Schema allows one to define an XML document in both its node structure (e.g. elements and attributes) as well as the data types contained within these nodes. It allows one to define these data types in very explicit detail, e.g. a float with additional constraints like Maximum Number, Minimum Number, Total Digits, Fraction Digits, etc. Strings can also be given many additional constraints including Minimum and Maximum Lengths as well as matching a user defined Regular Expression, this is a perhaps the most effective constraint.

Because XML Schema is so powerful in terms of the explicitness of the constraints that can be placed on XML data, potentially large amounts of validation that would normally be performed in the business logic layer of an application can be reduced dramatically or even completely.

A great tool for Java/J2EE Developers is Java Architecture for XML Binding or JAXB which allows a developer to generate simple Java Bean classes which represent the structure of an underlying XML document, the classes can be generated from an existing XML Schema. Object/XML Mapping if you like.

JAXB allows a developer to convert XML documents into in-memory Java Bean Objects which act as an interface to the underlying XML, it also has the ability to serialize these in-memory Java Objects back into XML documents. Validation of the in-memory data is performed based on the original XML Schema from which the classes were generated, which means far less / no validation code would need to be written in the business logic layer of the application.

JAXB also allows the developer to generate an XML Schema based on existing Java code, so a developer can use an XML Database much like an Object database without ever getting into the detail of using XML, XQuery or SOAP / RESTful Web Services.

Conclusion

A new project which deals with XML and/or unpredictable data, choosing to use a Relational Database will not stop the project in its tracks but a great deal of time will be wasted on trivial matters that could be easily solved by making use of an XML Database instead.

If you would like to know more about any of the technologies discussed in this article, or are considering using them in your project or business direction, please get in touch.