Formatting and Indenting XML Documents

Oxygen XML Editor creates XML documents using several different edit modes. In text mode, you as the author decide how the XML file is formatted and indented. In the other modes, and when you switch between modes, Oxygen XML Editor must decide how to format and indent the XML. Oxygen XML Editor will also format and indent your XML for you in text mode if you use one of the Format and Indent options:

A number of settings affect how Oxygen XML Editor formats and indents XML. Many of these settings have to do with how whitespace is handled.

Significant and insignificant whitespace in XML

XML documents are text files that describe complex documents. Some of the white space (spaces, tabs, line feeds, etc.) in the XML document belongs to the document it describes (such as the space between words in a paragraph) and some of it belongs to the XML document (such as a line break between two XML elements). Whitespace belonging to the XML file is called insignificant whitespace. The meaning of the XML would be the same if the insignificant whitespace were removed. Whitespace belonging to the document being described is called significant whitespace.

Knowing when whitespace is significant or insignificant is not always easy. For instance, a paragraph in an XML document might be laid out like this:

<p>
NO Freeman shall be taken or imprisoned, or be disseised of his Freehold, or Liberties, or
free Customs, or be outlawed, or exiled, or any other wise destroyed; nor will We not pass
upon him, nor condemn him, but by lawful judgment of his Peers, or by the <xref 
href="http://en.wikipedia.org/wiki/Law_of_the_land" format="html" scope="external">Law of the land</xref>. 
We will sell to no man, we will not deny or defer to any man either Justice or Right.
</p>

By default, XML considers a single whitespace between words to be significant, and all other whitespace to be insignificant. Thus the paragraph above could be written all on one line with no spaces between the start tag and the first word or between the last word and the end tag and the XML parser would see it as exactly the same paragraph. Removing the insignificant space in markup like this is called normalizing space.

In some cases, all the spaces inside an element should be treated as significant. For example, in a code sample:

<codeblock>
class HelloWorld
{
   public static void main(String args[])
   {
      System.out.println("Hello World");
   }
}
</codeblock>

Here every whitespace character between the codeblock tags should be treated as significant.

How Oxygen XML Editor determines when whitespace is significant

When Oxygen XML Editor formats and indents an XML document, it introduces or removes insignificant whitespace to produce a layout with reasonable line lengths and elements indented to show their place in the hierarchy of the document. To correctly format and indent the XML source, Oxygen XML Editor needs to know when to treat whitespace as significant and when to treat it as insignificant. However it is not always possible to tell this from the XML source file alone. To determine what whitespace is significant, Oxygen XML Editor assigns each element in the document to one of four categories:

Ignore space

In the ignore space category, all whitespace is considered insignificant. This generally applies to content that consists only of elements nested inside other elements, with no text content.

Normalize space

In the normalize space category, a single whitespace character between character strings is considered significant and all other spaces are considered insignificant. This generally applies to elements that contain text content only. This content can be normalized by removing insignificant whitespace. Insignificant whitespace may then be added to format and indent the content.

Mixed content

In the mixed content category, a single whitespace between text characters is considered significant and all other spaces are considered insignificant. However,

  • Whitespace between two child elements embedded in the text is normalized to a single space (rather than to zero spaces as would normally be the case for a text node with only whitespace characters, or the space between elements generally).

  • The lack of whitespace between a child element embedded in the text and either adjacent text or another child element is considered significant. That is, no whitespace can be introduced here when formatting and indenting the file.

For example:

<p>The file is located in <i>HOME</i>/<i>USER</i>/hello. This is s <strong>big</strong> 

<emphasis>deal</emphasis>.
</p>

In this example, whitespace should not be introduced around the i tags as it would introduce extra significant whitespace into the document. The space between the end </strong> tag and the beginning <emphasis> tag should be normalized to a single space, not zero spaces.

Preserve space

In the preserve space category, all whitespace in the element is regarded as significant. No changes are made to the spaces in elements in this category. Note, however, that child elements may be in a different category, and may be treated differently.

Attribute values are always in the preserve space category. The spaces between attributes in an element tag are always in the default space category.

Oxygen XML Editor consults several pieces of information to assign an element to one of these categories. An element is always assigned to the most restrictive category (from Ignore to Preserve) that it is assigned to by any of the sources Oxygen XML Editor consults. For instance, if the element is named on the Default elements list (as described below) but it has an xml:space="preserve" attribute in the source file, it will be assigned to the preserve space category. If an element has the xml:space="default" attribute in the source, but is listed on the Mixed content elements list, it will be assigned to the mixed content category.

To assign elements to these categories, Oxygen XML Editor consults information from the following sources:

xml:space
If the XML element contains the xml:space attribute, the element is promoted to the appropriate category based on the value of the attribute.
CSS whitespace property
If the CSS stylesheet controlling the Author mode editor applies the whitespace: pre setting to an element, it is promoted to the preserve space category.
CSS display property
If a text node contains only white-spaces:
  • If the node has a parent element with the CSS display property set to inline then the node is promoted to the mixed content category.
  • If the left or right sibling is an element with the CSS display property set to inline then the node is promoted to the mixed content category.
  • If one of its ancestors is an element with the CSS display property set to table then the node is assigned to the ignore space category.

Schema aware formatting

If a schema is available for the XML document, Oxygen XML Editor can use information from the schema to promote the element to the appropriate category. For example:

  • If the schema declares an element to be of type xs:string, the element will be promoted to the preserve space category because the string built-in type has the whitespace facet with the value preserve.

  • If the schema declares an element to be mixed content, it will be promoted to the mixed content category.

Schema aware formatting can be turned on and off.

  • To turn it on or off for Author mode, open the Preferences dialog box and go to Editor > Edit modes > Author > Schema aware > Schema aware normalization, format and indent.

  • To turn it on or off for the Text editing mode ,open the Preferences dialog box and go to Editor > Format > XML > Schema aware format and indent.

Preserve space elements list

If an element is listed in the Preserve space list in the XML formatting preferences, it is promoted to the preserve space category.

Default space elements list

If an element is listed in the Default space list in the XML formatting preferences, it is promoted to the default space category

Mixed content elements list

If an element is listed in the Mixed content list in the XML formatting preferences, it is promoted to the mixed content category.

Element content

If an element contains mixed content, that is, a mix of text and other elements, it is promoted to the mixed content category. (Note that, in accordance with these rules, this happens even if the schema declares the element to have element only content.)

If an element contains text content, it is promoted to the default space category.

Text node content
If a text node contains any non-whitespace characters then the text node is promoted to the normalize space category.

An exception to the rule

In general, a element can only be promoted to a more restrictive category (one that treats more whitespace as significant). However, there is one exception. In Author mode, if an element is marked as mixed content in the schema, but the actual element contains no text content, it can be demoted to the space ignore category if all of its child elements are displayed as blocks by the associated CSS (that is, they have a CSS property of display: block). For example, in some schemas, a section or a table entry can be defined as having mixed content but in many cases they contain only block elements. In these cases, any whitespace they contain cannot be significant and they can be treated as space ignore elements. This exception can be turned on or off using the option Editor / Edit modes / Author / Schema aware.

How Oxygen XML Editor formats and indents XML

You can control how Oxygen XML Editor formats and indents XML documents. This can be particularly important if you store your XML document in a version control system, as it allows you to limit the number of trivial changes in spacing between versions of an XML document. The following settings pages control how XML documents are formatted:

When Oxygen XML Editor formats and indents XML

Oxygen XML Editor formats and indents a document, or part of it, on the following occasions: