Balisage 2009 Pre Conference Session Proposal

Title: Efficient Scripting of XML Process

Authors: David A. Lee, Epocrates Inc.
Norman Walsh, Mark Logic Inc.

Abstract:

Much attention is given to the efficiency and performance of individual XML operations, such as parsing, internal memory, representations, processing (XSLT, XQuery) and serialization formats. However real world uses cases typically involve many operations often composed in a “scripting” environment. The performance of the scripting environment can often overshadow any performance gains in individual operations. This session will focus on comparing several scripting languages and techniques used to perform a set of operations taken from real world uses from Epocrates and evaluate performance characteristics and suggest “best practices” for scripting XML processes. Scripting languages compared will be DOS Shell (CMD.EXE), Linux Shell (bash), XMLSH and XPROC (calabash), these will be run (where possible) on multiple operating systems, Windows XP, Linux, and Mac/OS.

Details:

David Lee and Norman Wash will collaborate on producing a set of 3 typical use cases taken from real world examples. Test cases will be prepared to perform equivalent operations in the 4 scripting languages (DOS, bash, xmlsh , xproc) with an attempt to use “typical best practices” to code each script in a fashion appropriate to that language. Where possible, the same underlying processor implementations will be used for the individual operations (e.g, java, apache, saxon). The test cases will be run in standardized environments on at least 3 hardware and OS platforms (Windows XP, Linux (x86), and Mac/OS ). Performance results will be collected, analyzed and reported.

Where interesting, different coding variants of a given test will be shown in order to demonstrate the usefulness (or not) of minor changes to scripts. This will help establish the validity of proposed “best practices’ in the different scripting environments. Comparison of results across systems will be more for completeness and applicability then direct comparison as there will not be an attempt to replicate identical hardware across operating systems.

Testing and Analysis Details

Source Documents

Tests will be executed against a standard corpus of XML documents taken from Epocrates production environment. These are a set of approx 600 XML files each containing a “monograph” (clinical reference article) about a specific disease. The files are from 20kbyte to 250kbytes and total approx 70MB.

Tests

Three types of tests based on Epocrates production data processing will be run. These will be chosen to represent a range of typical XML processing tasks performed by scripts.

Note to committee: The exact details of the tests and measurement methodology may change slightly by the time they are complete. An updated paper will be submitted along with the results prior to the conference.

This test exemplifies a simple query of a large set of xml file to produce a simple output.

A) Extract titles
An xquery is run to extract the topic title from each file and generates a consolidated XML file

B) Format as XHTML
An XSLT is run to format the results as an XHTML page.

The result is a single HTML file.

Conditional Logic

This test examines the task of using data from XML files in combination with non-xml data in the environment in a single script. Each monograph (xml file) contains references to image files which may or may not exist. The test extracts the set of target image file names from each xml file and tests for their existence in the file system producing an output XML file which lists each topic along with only those images that exist in the filesystem marked up with the caption names.

Content Generation

This test exemplifies a complex set of content generation over a large set of xml files. Each XML file is the source of multiple content pages. The pages are defined in a separate xml file which specifies an xquery script to run for each page.

For each monograph (xml file) the following process is run

A) For each page defined in the page configuration

a. An xpath is run to determine if that page is available for the given xml file

b. An xquery is run to generate an intermediate XML file with the contents of that page

c. An XSLT is run to that formats the XML page into an XHTML file

B) An xquery is run to generate an intermediate XML file with references to all generated pages for that monograph which is used as a intro page.

C) An XSLT is run which formats the intro page into XHTML

The output is a directory of approx 4000 XHTML files.

Scripts

Scripts for each of the Scenarios will be written in 4 scripting languages, all using the same underlying processors for the core components (xpath,xquery,xslt).

DOS (cmd.exe)

Linux (bash)

XProc (calabash)

xmlsh

Tests

The scripts will be executed in a controlled environment on 3 operating systems (Windows XP, Linux FC9, MacOS). The DOS script, obviously, can only be run on one environment. Time to completion will be measured for each test. Within the constraints of the environment, the scripts will be written with reasonable effort to demonstrate “Best Practices” for that language. Time permitting, variations of a given script may be run to examine the results of minor changes of usage of a given scripting language.

Results

Within each environment, performance will be compared. Comparing across environments will be used to only to indicate a general trend if one exists (due to hardware differences they wont otherwise be directly comparable). The authors will draw on the results of these tests along with their experience in the various scripting languages to suggest strengths and weaknesses in the various scripting approaches within each of the languages and suggest “Best Practices” to follow for scripting within that language and for choosing a scripting language.

The acuttal results may not be what the authors predict. There may be little differences in the choice of scripting languages, or there may be vast difference. Either way this should be very interesting.

Norman Walsh Bio

Norman Walsh is a Principal Technologist in the Information & Media group at Mark Logic Corporation where he assists in the design and deployment of advanced content applications. Norm is also an active participant in a number of standards efforts worldwide: he is chair of the XML Processing Model Working Group at the W3C where he is also co-chair of the XML Core Working Group. At OASIS, he is chair of the DocBook Technical Committee.

Before joining Mark Logic, he participated in XML-related projects and standards efforts at Sun Microsystems. With more than a decade of industry experience, Mr. Walsh is well known for his work on DocBook and a wide range of open source projects. He is the principle author of DocBook: The Definitive Guide.

David Lee Bio

David Lee has over 20 years experience in the software industry responsible for many major projects in small and large companies including Sun Microsystems, IBM, Centura Software (formerly Gupta.), Premenos, Epiphany (formerly RightPoint), WebGain. As principal senior software engineer at Epocrates, Inc., Mr Lee is responsible for managing data integration, storage, retrieval, and processing of clinical knowledge databases for the leading clinical information provider.

Key career contributions include Real-time AIX OS extensions for optimizing transmission of real-time streaming video (IBM), secure encrypted EDI over internet email (Premenos), porting the Centura Team Desktop system to Solaris (Gupta,Centura), optimizations of large Enterprise CRM systems (Epiphany), author of xmlsh an open source scripting language for XML.

Balisage 2009 Pre Conference Session Proposal

Details:

Testing and Analysis Details

Source Documents

Tests

Table of contents

Conditional Logic

Content Generation

Scripts

Tests

Results