Feb 1, 2006

Sparklines, internship ends, MS Office XML documents

In all the furore over my new and continually-evolving design I've been neglectful of my Sparklines code. Well, the good news is I've been working on it so intensively that the current version of sparklines.doc is so much more functional than the code I've posted here that I've seriously been thinking about deleting the code from the last two sparklines entries. But hell, it's amusing to look at.

The bad news is I've been working on even more exciting stuff for the last few days -- like my blog and conversion of the bank's daily reports into parseable XML form -- that the work I'm doing on sparklines.doc has slowed to a crawl. BUT to be fair, it meets my needs fully.

I'll take a brief interlude here and talk about some of the stuff I tried to do during my internship at ONE Bank, Dhanmondi branch. It'll lead up directly to why I'm so hyped-up about MS Office's new XML file format -- and this is weird, because just a few days ago I'd have told you OpenOffice.org's XML file format is better than Microsoft's. Now I'm strongly inclined to say otherwise.

At the bank, as I (think I) mentioned in Sparklines: can't resist, they have a lot of computer-generated output put in their hard drives daily. I guess their database-querying and -reporting software is tasked to process the day's transactions and output reports on the states of the various accounts, clients and such, every night. Now these reports are in plain-text format and currently the people in my branch, whenever they need to look up some information, just open up the report files in Wordpad and do a search for it.

This simple searching of plain-text files is well and good for small-scale information needs like looking up the account number of an account holder who can't recall the number, finding the interest rates offered on different types of deposits and loans, and sometimes also finding out historical interest rates. But it quickly starts sucking up your time if you have to keep doing things like:

  • prepare a monthly report on deposit mobilisation -- that is, a tally of the people who opened and closed deposit accounts, along with their account balances, and total amount of money deposited and withdrawn thus;

  • prepare reports with tallies of amounts grouped by type (deposit/loan), interest rate, and then economic sector code, as required by Bangladesh Bank;

  • prepare credit risk grading reports;

  • create mass-mailings to send out to account-holders and prospective clients;

  • email daily lists of transactions to the companies with which the bank has bill-collection arrangements;

  • and many more types of documents that the employees of each branch routinely have to prepare.


The common theme running through all of these different tasks is: the user has to process information output from the central database(s) in different ways and create documents showing these data in a nicely formatted way. And this has to be done month after month, with a lot of the document staying basically the same -- the changing data being the newly-processed information.

To me, the processes above are screaming to be automated. And this is where Office's new XML file formats come in. From what I've read about Office's (2003 and above) capabilities in Brian Jones' blog, Word lets you define arbitrary arrangements for your data and then lets you tell it how to format and display the data. This is done through the magic of XML schemas and stylesheets. For details, check out the article. But in short, suppose you start out with some raw data you're working on, information about about some accounts:









Account IDHolder NameBalanceDebit/Credit
1234567890Mr X100000Cr
2345678901Ms Y96000Cr
3456789012Dr Z45009.87Dr

You have this data in XML format, obviously ideal because of its parseability to both humans and computers. Say, this is your XML:

<?xml version="1.0"?>
<?mso-application progid="Word.Document"?>
<al:accountlist xmlns:al="http://yawar.blogspot.com">
<al:account>
<al:accid>1234567890</al:accid>
<al:holdername>Mr X</al:holdername>
<al:balance>100000</al:balance>
<al:drcr>Cr</al:drcr>
</al:account>
<al:account>
<al:accid>2345678901</al:accid>
<al:holdername>Ms Y</al:holdername>
<al:balance>96000</al:balance>
<al:drcr>Cr</al:drcr>
</al:account>
<al:account>
<al:accid>3456789012</al:accid>
<al:holdername>Dr Z</al:holdername>
<al:balance>45009.87</al:balance>
<al:drcr>Dr</al:drcr>
</al:account>
</al:accountlist>

Now, you need a way to tell Word (or any other XML-processing program) what kind of values to expect in each field so that it doesn't goof up on bad data: the account ID should be a sequence of ten digits; the name should be a string; the balance a real number (greater than zero), and the Debit/Credit field should be either `Dr' or `Cr', and nothing else. In fact, we could really just use `d' and `c', but Dr and Cr are time-honoured abbreviations of the words. Turns out the way to do is is through another XML file, a schema definition file.

More about schemas at MSDN's Advanced XML Support in Word and the W3C's XML Schema Primer.


Schema generator at XSD Inference Demo.

More tools, including one that validates your XML file against its schema, at XML Tools.


The schema definition for our account listing should be something like:

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://yawar.blogspot.com"
xmlns:al="http://yawar.blogspot.com"
elementFormDefault="qualified">
<xsd:element name="accountlist" />
<xsd:complexType>
<xsd:sequence>
<xsd:element name="account" minOccurs="1" maxOccurs="unbounded">
<xsd:complexType>
<xsd:all>
<xsd:element name="accid">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="[0-9]{10}" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="holdername" type="xsd:string" />
<xsd:element name="balance">
<xsd:simpleType>
<xsd:restriction base="xsd:decimal">
<xsd:minInclusive value="0" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="drcr">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="[DC]r" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:all>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>

Yes, it looks rather daunting, but it's not that hard; I whipped this schema up myself browsing through W3Schools' Schema tutorials.

The last piece of the puzzle is, how do we tell Word how to format and display our nice XML file? The answer is the standardised XML Stylesheet Language, XSL. Yet another piece of XML coding, this file instructs Word on how to create a Word XML document on-the-fly from the XML data file that you have (the accounts listing file). Let me try a whimsical explanation here. Imagine the stylesheet file is talking to Word, giving running instructions as the input file is being processed.

`Started reading the document? OK, write the heading, ``ACCOUNT LISTING''. Format it with the ``Heading 1'' style. Now leave a blank line and start a four-column table, with column headers ``Account ID'', ``Holder Name'', ``Balance'' and ``Debit/Credit''.

`Now for each <account>, create a new table row, and: put the contents of the <accid> in the first column; the contents of the <holdername> in the second column; <balance> in the third; and <drcr> in the fourth. Oh, and sort the table rows by account holder name.'

And remember, this is a Word document that is being created -- not an HTML file. Yeah, you can do all that with XSL inside Word!

As an aside, for someone like me, who cut his teeth on LaTeX and then a little bit of DocBook (SGML and XML) with PassiveTeX, Jade, Apache FOP, you name it, Word's new XML capabilities just blow me away. It looks like Word has become the powerful XML processing and transformation engine that documentation writers have always dreamed of.


Soon, I'll post the stylesheet file I've created to do the transformation, and hopefully graphical comparisons of the different views of the same XML document.

No comments: