Feb 6, 2006

Styling Office XML Documents

This post has been due for several days now. Been doing more research into Office 2003's XML file formats. The primary port of call for all budding Office 2003 XML developers is Office 2003 XML Reference Schemas. This is where you can download the schemas -- the formal descriptions -- and the explanatory documentation on the XML document formats for Word, Excel and others. Another important link is to the page for O'Reilly's new book, Office 2003 XML. There is a download for a sample chapter, Chapter 2: The WordprocessingML Vocabulary. Obviously these are very important references for someone who is just entering the field.

When I posted my last entry, I had already created the style file that tells Word how to display the raw account listing. I just wanted to play around with it a little bit, especially to see if I could get the table formatting right. The formatting as it currently is, is OK; but I wanted to customise it a little bit.

By now I've realised that mastery of tables in WordprocessingML will take some time and (at least) a couple of good references (see links above). So I'll just go ahead with the original plan.

Before I list the actual XSL transformations file that does the magic, I want to actually show its results, to get some oohs and aahs from the audience. Here they are:



The account listing as shown by Word when Word has no way of knowing how else to show it.


alist



The account listing with an XSL transformation applied by Word. That is, when the XSL file tells Word how to display it.


alist_transformed



OK, here is the XSL style file:


<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:al="http://yawar.blogspot.com">
<xsl:template match="/">
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core" xmlns:aml="http://schemas.microsoft.com/aml/2001/core" xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" w:macrosPresent="no" w:embeddedObjPresent="no" w:ocxPresent="no" xml:space="preserve">
<o:DocumentProperties>
<o:Title>Account Listing</o:Title>
<o:Author>Yawar Amin</o:Author>
</o:DocumentProperties>
<w:fonts>
<w:defaultFonts w:ascii="Times New Roman" w:h-ansi="Times New Roman" w:cs="Times New Roman"/>
</w:fonts>
<w:styles>
<w:style w:type="paragraph" w:default="on" w:styleId="Normal">
<w:name w:val="Normal"/>
<w:rPr>
<wx:font wx:val="Times New Roman"/>
<w:sz w:val="24"/>
<w:sz-cs w:val="24"/>
<w:lang w:val="EN-GB" w:fareast="EN-US" w:bidi="AR-SA"/>
</w:rPr>
</w:style>
<w:style w:type="paragraph" w:styleId="Heading1">
<w:name w:val="heading 1"/>
<wx:uiName wx:val="Heading 1"/>
<w:basedOn w:val="Normal"/>
<w:next w:val="Normal"/>
<w:rsid w:val="00B04D4D"/>
<w:pPr>
<w:pStyle w:val="Heading1"/>
<w:keepNext/>
<w:pBdr>
<w:top w:val="dotted" w:sz="4" wx:bdrwidth="10" w:space="1" w:color="auto"/>
</w:pBdr>
<w:spacing w:before="240" w:after="60"/>
<w:jc w:val="center"/>
<w:outlineLvl w:val="0"/>
</w:pPr>
<w:rPr>
<wx:font wx:val="Times New Roman"/>
<w:b/>
<w:b-cs/>
<w:kern w:val="32"/>
<w:sz w:val="48"/><w:sz-cs w:val="48"/>
</w:rPr>
</w:style>
<w:style w:type="table" w:styleId="MyTableContemporary">
<w:name w:val="My Table Contemporary"/>
<w:basedOn w:val="TableNormal"/>
<w:rPr>
<wx:font wx:val="Times New Roman"/>
</w:rPr>
<w:tblPr>
<w:tblInd w:w="0" w:type="dxa"/>
<w:tblBorders>
<w:insideH w:val="single" w:sz="18" wx:bdrwidth="45" w:space="0" w:color="FFFFFF"/>
<w:insideV w:val="single" w:sz="18" wx:bdrwidth="45" w:space="0" w:color="FFFFFF"/>
</w:tblBorders>
<w:tblCellMar>
<w:top w:w="0" w:type="dxa"/>
<w:left w:w="108" w:type="dxa"/>
<w:bottom w:w="0" w:type="dxa"/>
<w:right w:w="108" w:type="dxa"/>
</w:tblCellMar>
</w:tblPr>
<w:tblStylePr w:type="firstRow">
<w:rPr>
<w:b/>
<w:b-cs/>
<w:color w:val="auto"/>
</w:rPr>
<w:tblPr/>
<w:tcPr>
<w:tcBorders>
<w:tl2br w:val="none" w:sz="0" wx:bdrwidth="0" w:space="0" w:color="auto"/>
<w:tr2bl w:val="none" w:sz="0" wx:bdrwidth="0" w:space="0" w:color="auto"/>
</w:tcBorders>
<w:shd w:val="pct-20" w:color="000000" w:fill="FFFFFF" wx:bgcolor="F2F2F2"/>
</w:tcPr>
</w:tblStylePr>
<w:tblStylePr w:type="band1Horz">
<w:rPr>
<w:color w:val="auto"/>
</w:rPr>
<w:tblPr/>
<w:tcPr>
<w:tcBorders>
<w:tl2br w:val="none" w:sz="0" wx:bdrwidth="0" w:space="0" w:color="auto"/>
<w:tr2bl w:val="none" w:sz="0" wx:bdrwidth="0" w:space="0" w:color="auto"/>
</w:tcBorders>
<w:shd w:val="pct-5" w:color="000000" w:fill="FFFFFF" wx:bgcolor="FFFFFF"/>
</w:tcPr>
</w:tblStylePr>
<w:tblStylePr w:type="band2Horz">
<w:rPr>
<w:color w:val="auto"/>
</w:rPr>
<w:tblPr/>
<w:tcPr>
<w:tcBorders>
<w:tl2br w:val="none" w:sz="0" wx:bdrwidth="0" w:space="0" w:color="auto"/>
<w:tr2bl w:val="none" w:sz="0" wx:bdrwidth="0" w:space="0" w:color="auto"/>
</w:tcBorders>
<w:shd w:val="pct-20" w:color="000000" w:fill="FFFFFF" wx:bgcolor="F2F2F2"/>
</w:tcPr>
</w:tblStylePr>
</w:style>
</w:styles>
<w:docPr>
<w:view w:val="print"/>
<w:zoom w:percent="100"/>
<w:doNotEmbedSystemFonts/>
<w:validateAgainstSchema/>
<w:saveInvalidXML w:val="off"/>
<w:ignoreMixedContent w:val="off"/>
<w:alwaysShowPlaceholderText w:val="off"/>
</w:docPr>
<w:body>
<wx:sect>
<w:sectPr>
<w:pgSz w:w="11909" w:h="16834" w:orient="portrait" w:code="9"/>
</w:sectPr>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>ACCOUNT LISTING</w:t>
</w:r>
</w:p>
<w:p></w:p>
<w:tbl>
<w:tblPr>
<w:tblStyle w:val="MyTableContemporary"/>
<w:tblW w:w="5000" w:type="pct"/>
<w:tblLook w:val="01E0"/>
</w:tblPr>
<w:tblGrid>
<w:gridCol w:w="2832"/>
<w:gridCol w:w="3238"/>
<w:gridCol w:w="2089"/>
<w:gridCol w:w="3061"/>
</w:tblGrid>
<w:tr>
<w:tc>
<w:p>
<w:pPr>
<w:jc w:val="right"/>
</w:pPr>
<w:r>
<w:t>Account ID</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:p>
<w:r>
<w:t>Holder Name</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:p>
<w:pPr>
<w:jc w:val="right"/>
</w:pPr>
<w:r>
<w:t>Balance</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:p>
<w:r>
<w:t>Debit/Credit</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
<xsl:for-each select="al:accountlist/al:account">
<xsl:sort select="al:holdername"/>
<w:tr>
<w:tc>
<w:p>
<w:pPr>
<w:jc w:val="right"/>
</w:pPr>
<w:r>
<w:t><xsl:value-of select="al:accid"/></w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:p>
<w:r>
<w:t><xsl:value-of select="al:holdername"/></w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:p>
<w:pPr>
<w:jc w:val="right"/>
</w:pPr>
<w:r>
<w:t><xsl:value-of select="al:balance"/></w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:p>
<w:r>
<w:t><xsl:value-of select="al:drcr"/></w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</xsl:for-each>
</w:tbl>
</wx:sect>
</w:body>
</w:wordDocument>
</xsl:template>
</xsl:stylesheet>


Yeah, whew! That was intense. Mostly though, it was the Word XML markup, which I won't even try to explain now. But for more on WordprocessingML, please check out the latest article at Brian Jones' blog. It's got an excellent mid-level overview.

I've made the XSL instructions bold so you can pick them out clearly and marvel at how few of them there are. (By the way, learned the XSLT at the W3Schools' XSLT Tutorial.) Basically, they say the same thing I described towards the end of my last entry.

Sorry about the underlines in the listing -- have been exploring off-the-top-of-my-head ways to best show listings in HTML, and this is the best compromise I've been able to find between source code editability and web page readability. Will update later if I find anything better. Leave comments with any ideas you might have. Well, got rid of the underlines with some cool new hacks I didn't know about before. Check out the rule for the <pre> tag in my stylesheet.


So what has this exercise accomplished? We see that Word has become, as a result of customers' demands on it, a full-fledged XML transformation and validation engine. With this power, businesses have an amazing new ability to juggle information around, push it into and pull it out of Office documents, change it, and just generally go crazy with it.

I know that the trend in business lately forever in our age has been to, whenever a new problem is faced, just throw more technology at it. Am I complaining? No way. Bring it on!

Feb 1, 2006

Sparklines, internship ends, MS Office XML documents

In all the furore over my new and continually-evolving design I've been neglectful of my Sparklines code. Well, the good news is I've been working on it so intensively that the current version of sparklines.doc is so much more functional than the code I've posted here that I've seriously been thinking about deleting the code from the last two sparklines entries. But hell, it's amusing to look at.

The bad news is I've been working on even more exciting stuff for the last few days -- like my blog and conversion of the bank's daily reports into parseable XML form -- that the work I'm doing on sparklines.doc has slowed to a crawl. BUT to be fair, it meets my needs fully.

I'll take a brief interlude here and talk about some of the stuff I tried to do during my internship at ONE Bank, Dhanmondi branch. It'll lead up directly to why I'm so hyped-up about MS Office's new XML file format -- and this is weird, because just a few days ago I'd have told you OpenOffice.org's XML file format is better than Microsoft's. Now I'm strongly inclined to say otherwise.

At the bank, as I (think I) mentioned in Sparklines: can't resist, they have a lot of computer-generated output put in their hard drives daily. I guess their database-querying and -reporting software is tasked to process the day's transactions and output reports on the states of the various accounts, clients and such, every night. Now these reports are in plain-text format and currently the people in my branch, whenever they need to look up some information, just open up the report files in Wordpad and do a search for it.

This simple searching of plain-text files is well and good for small-scale information needs like looking up the account number of an account holder who can't recall the number, finding the interest rates offered on different types of deposits and loans, and sometimes also finding out historical interest rates. But it quickly starts sucking up your time if you have to keep doing things like:

  • prepare a monthly report on deposit mobilisation -- that is, a tally of the people who opened and closed deposit accounts, along with their account balances, and total amount of money deposited and withdrawn thus;

  • prepare reports with tallies of amounts grouped by type (deposit/loan), interest rate, and then economic sector code, as required by Bangladesh Bank;

  • prepare credit risk grading reports;

  • create mass-mailings to send out to account-holders and prospective clients;

  • email daily lists of transactions to the companies with which the bank has bill-collection arrangements;

  • and many more types of documents that the employees of each branch routinely have to prepare.


The common theme running through all of these different tasks is: the user has to process information output from the central database(s) in different ways and create documents showing these data in a nicely formatted way. And this has to be done month after month, with a lot of the document staying basically the same -- the changing data being the newly-processed information.

To me, the processes above are screaming to be automated. And this is where Office's new XML file formats come in. From what I've read about Office's (2003 and above) capabilities in Brian Jones' blog, Word lets you define arbitrary arrangements for your data and then lets you tell it how to format and display the data. This is done through the magic of XML schemas and stylesheets. For details, check out the article. But in short, suppose you start out with some raw data you're working on, information about about some accounts:









Account IDHolder NameBalanceDebit/Credit
1234567890Mr X100000Cr
2345678901Ms Y96000Cr
3456789012Dr Z45009.87Dr

You have this data in XML format, obviously ideal because of its parseability to both humans and computers. Say, this is your XML:

<?xml version="1.0"?>
<?mso-application progid="Word.Document"?>
<al:accountlist xmlns:al="http://yawar.blogspot.com">
<al:account>
<al:accid>1234567890</al:accid>
<al:holdername>Mr X</al:holdername>
<al:balance>100000</al:balance>
<al:drcr>Cr</al:drcr>
</al:account>
<al:account>
<al:accid>2345678901</al:accid>
<al:holdername>Ms Y</al:holdername>
<al:balance>96000</al:balance>
<al:drcr>Cr</al:drcr>
</al:account>
<al:account>
<al:accid>3456789012</al:accid>
<al:holdername>Dr Z</al:holdername>
<al:balance>45009.87</al:balance>
<al:drcr>Dr</al:drcr>
</al:account>
</al:accountlist>

Now, you need a way to tell Word (or any other XML-processing program) what kind of values to expect in each field so that it doesn't goof up on bad data: the account ID should be a sequence of ten digits; the name should be a string; the balance a real number (greater than zero), and the Debit/Credit field should be either `Dr' or `Cr', and nothing else. In fact, we could really just use `d' and `c', but Dr and Cr are time-honoured abbreviations of the words. Turns out the way to do is is through another XML file, a schema definition file.

More about schemas at MSDN's Advanced XML Support in Word and the W3C's XML Schema Primer.


Schema generator at XSD Inference Demo.

More tools, including one that validates your XML file against its schema, at XML Tools.


The schema definition for our account listing should be something like:

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://yawar.blogspot.com"
xmlns:al="http://yawar.blogspot.com"
elementFormDefault="qualified">
<xsd:element name="accountlist" />
<xsd:complexType>
<xsd:sequence>
<xsd:element name="account" minOccurs="1" maxOccurs="unbounded">
<xsd:complexType>
<xsd:all>
<xsd:element name="accid">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="[0-9]{10}" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="holdername" type="xsd:string" />
<xsd:element name="balance">
<xsd:simpleType>
<xsd:restriction base="xsd:decimal">
<xsd:minInclusive value="0" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="drcr">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="[DC]r" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:all>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>

Yes, it looks rather daunting, but it's not that hard; I whipped this schema up myself browsing through W3Schools' Schema tutorials.

The last piece of the puzzle is, how do we tell Word how to format and display our nice XML file? The answer is the standardised XML Stylesheet Language, XSL. Yet another piece of XML coding, this file instructs Word on how to create a Word XML document on-the-fly from the XML data file that you have (the accounts listing file). Let me try a whimsical explanation here. Imagine the stylesheet file is talking to Word, giving running instructions as the input file is being processed.

`Started reading the document? OK, write the heading, ``ACCOUNT LISTING''. Format it with the ``Heading 1'' style. Now leave a blank line and start a four-column table, with column headers ``Account ID'', ``Holder Name'', ``Balance'' and ``Debit/Credit''.

`Now for each <account>, create a new table row, and: put the contents of the <accid> in the first column; the contents of the <holdername> in the second column; <balance> in the third; and <drcr> in the fourth. Oh, and sort the table rows by account holder name.'

And remember, this is a Word document that is being created -- not an HTML file. Yeah, you can do all that with XSL inside Word!

As an aside, for someone like me, who cut his teeth on LaTeX and then a little bit of DocBook (SGML and XML) with PassiveTeX, Jade, Apache FOP, you name it, Word's new XML capabilities just blow me away. It looks like Word has become the powerful XML processing and transformation engine that documentation writers have always dreamed of.


Soon, I'll post the stylesheet file I've created to do the transformation, and hopefully graphical comparisons of the different views of the same XML document.