Using Java XML API
September 1, 2008
Author: Fazal Gupta
Introduction
Java XML APIs have been one of the most used APIs in Java world. Still I think most of you would agree with me that there are so many mysteries in it, that even for developers who have been working on them for ages, it is a bit scary. The cumbersome implementations have only multiplied the horrors. But fact of the matter is whether you love or hate them, you need to live with them and use them in the most effective manner to solve the problem at hand while spending minimum of your effort in figuring out ways to parse the XML. In this article I would like to mention few small idiosyncrasies which I had to face while working with XML APIs. I have worked with Xerces and Xalan implementations and therefore my experience is based on these implementations.
Importance of METHOD property for Transformer
Take the case of javax.xml.transform.Transformer class. Typically Transformer class is used to convert a source XML tree into a result XML tree by applying some rules. Typically the rules are written in XSL. Also one might need to convert a given DOM Node to its string representation for which Transformer class is used.
There are many important properties which one sets on the transformer before applying the transformation. The properties are set on the transformer object using setOutputProperty method. As such the method only takes two String arguments, but the possible set of properties can be figured out by looking at javax.xml.transform.OutputKeys class.
One of the interesting properties of this class is the METHOD property. This property identifies the overall method that should be used for outputting the result tree. The three possible options are xml, html or text. I had been using this API for more than 2 years but it was only recently that I realized the importance of this property. Take the below XHTML as an example.
<html>
<head>
<p>This is head</p>
</head
</body>
<p> </p>
<p>HelloWorld</p>
</body>
</html>
If one tries to parse this DOM node into a String with METHOD property set to anything other than html, you will see that the numeric code for “ “ ( ), would be ignored or may get converted into some junk character (the behavior is not consistent) and the final output of the string will be something like this
<html>
<head>
<p>This is head</p>
</head
</body>
<p> </p>
<p>HelloWorld</p>
</body>
</html>
All looks fine as of now but if one would open this html string in a html editor like tinymce, one would notice weird question mark symbols on the editor and most HTML editor expect for space.
One could overcome this issue by going in a smart way and replacing the “ “ symbol with , but there is a easier way than this. Just set the METHOD property of the transformer and   would automatically get converted into in the result. The result would be as following
<html>
<head>
<p>This is head</p>
</head
</body>
<p> </p>
<p>HelloWorld</p>
</body>
</html>
As an example following is the code piece to set the METHOD property
Transformer transformer = TransformerFactor.newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, “html”);
The way to set output properties of the parser looks bit cumbersome as the only way is to use setOutputProperty method. I am not sure if the API developers could have given an easier way (like providing setters for such properties), so that figuring them out becomes easier. But this is something worth investigating.
Converting a HTML Into a DOM
This point is the complementary scenario for the above point. Sometimes one may want to read a given HTML string and convert it into a Document object. HTML contains to represent space and if one tries to convert this string directly into DOM, you would encounter the following error
The entity “nbsp” was referenced, but not declared.
I am not sure what is the best way to handle this issue, but we guys converted to . This did solve the issue, but at the cost of expensive string replacement operation.
Copying a node from one document to another
Again this may be a common issue every developer would have faced while working with DOM API. Let’s say one is traversing over a DOM and a given node needs to be replaced with a node which was created from a different document. Doing this would result in the following error
WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it.
There are two possible method exposed in the Node class to achieve this. One is adoptNode and the other one is importNode. The difference between the two is that the former tries to adopt the same node object whereas the latter method creates a copy of the node and tries to adopt the copy. Following code can be used as a sample to achieve the copying of node using adoptNode method and replacing a given node of the given document.
Node doc1Node;
Node doc2Node;
Node adoptedNode = doc1Node.getOwnerDocument().adoptNode(doc1Node);
doc1Node.getParentNode().replaceChild(adoptedNode, doc1Node);
Issue in Using Xalan 2.7 with OC4J
This is not an API issue but a peculiar problem which happened due to bug in Xalan 2.7 because of which it could not be used with OC4J. Since this article was on XML APIs I just felt it would be better to mention this point briefly. The problem has already been solved in Xalan 2.7.1 and therefore now it remains more of academic interest. The issue was that Xalan 2.7 included BCEL classes which were not required for Xalan at all. This was actually a bug. More details on this issue can be found on the following link: http://issues.apache.org/jira/browse/XALANJ-2196
Finally something for Concurrency
Again this is a universally known fact but is worth mentioning in any discussion on Java XML APIs. Thread safety is always an important issue when one is using any API. In case of Java XML APIs, it’s worth keeping in mind that DocumentBuilder and DocumentBuilderFactory are not Thread safe. And it’s the onus of the developers who are using the objects of these classes to make sure they are not being used by multiple threads at the same time. While doing profiling of the application we observed that creation of documentbuilder objects was quite expensive. At that point we overlooked the thread safety aspect of the DocumentBuilder class and ended up paying a heavy price in the later phase of the release. As its known that concurrency errors come in such disguises that its hard to even understand them. In our case we were getting a Null Pointer Exception. That’s when I leant the hard way about the importance of thread safety. Hopefully most of you are already aware of this.
Entry Filed under: Basic. Tags: Java XML API Usage.
2 Comments Add your own
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed
1.
Christopher Sahnwaldt | September 2, 2008 at 6:01 pm
To parse HTML that may not be well-formed XML, try http://htmlcleaner.sourceforge.net
2.
fazalgupta | November 6, 2008 at 6:52 am
One interesting thing i found with html output method was the issue with self closing tags. E.g. for a tag like which is a self closing tag, trying to generate the HTML String using html output method leaves such tags open breaking the actual well formed structure. I am not sure if this is a reported bug in Xalan but I am still figuring out why this happens.
Also as such Outputkeys Class the method parameter documentation states that xhtml can be used as a method if implemented by spefic providers. Following is the java doc
Other non-namespaced values may be used, such as “xhtml”, but, if accepted, the handling of such values is implementation defined. If any of the method values are not accepted and are not namespace qualified, then {@link javax.xml.transform.Transformer#setOutputProperty}
or {@link javax.xml.transform.Transformer#setOutputProperties} will
throw a {@link java.lang.IllegalArgumentException}.
But when I tried this with Xalan, the xhtml output method though not supported (checked by debugging in the code), it never throws an exception..
I think the people implementing XML APIs need to become more consistent to make life easier for people like us.
In this context comment given by Christopher Sahnwaldt is relevant as we may need to make well formed the html string generated.