IT Computer Training Articles Tutorials - Submit Your Article - Articles Submission Directory. - http://www.articles.webtechvision.com
Include an XML Declaration
http://www.articles.webtechvision.com/articles/187/1/Include-an-XML-Declaration/Page1.html
Mike Tayler
 
By Mike Tayler
Published on 06/4/2007
 

Although XML declarations are optional, every XML document should have one. An XML declaration helps both human users and automated software identify the document as XML. It identifies the version of XML in use, specifies the character encoding, and can even help optimize the parsing. Most importantly, it's a crucial clue that what you're reading is in fact an XML document in environments where file type information is unavailable or unreliable.


Include an XML Declaration

Although XML declarations are optional, every XML document should have one. An XML declaration helps both human users and automated software identify the document as XML. It identifies the version of XML in use, specifies the character encoding, and can even help optimize the parsing. Most importantly, it's a crucial clue that what you're reading is in fact an XML document in environments where file type information is unavailable or unreliable.

The following are all legal XML declarations.

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<?xml version="1.0" standalone="yes"?>

In general the XML declaration must be the first thing in the XML document. It cannot be preceded by any comments, processing instruction, or even white space. The only thing that may sometimes precede it is the optional byte order mark.

The XML declaration is not a processing instruction, even though it looks like one. If you're processing an XML document through APIs like SAX, DOM, or JDOM, the methods invoked to read and write the XML declaration will not be the same methods invoked to read and write processing instructions. In many cases, including SAX2, DOM2, XOM, and JDOM, the information from the XML declaration may not be available at all. The parser will use it to determine how to read a document, but it will not report it to the client application.

Each XML declaration has up to three attributes.

  1. version: the version of XML in use. Currently this always has the value 1.0, though there may be an XML 1.1 in the future. (See Item 3.)

  2. encoding: the character set in which the document is written.

  3. standalone: whether or not the external DTD subset makes important contributions to the document's infoset.

Like other attributes, these may be enclosed in single or double quotes, and any amount of white space may separate them from each other. Unlike other attributes, order matters. The version attribute must always come before the encoding attribute, which must always come before the standalone declaration. The version attribute is required. The encoding attribute and standalone declaration are optional.


The version Info
The version attribute always has the value 1.0. If XML 1.1 is released in the future (and it may not be), this will probably also be allowed to have the value 1.1. Regardless, you should always use XML 1.0, never version 1.1. XML 1.0 is more compatible and more robust, and it offers all the features XML 1.1 does.

The encoding Declaration

The encoding attribute specifies which character set and encoding the document is written in. Sometimes this identifies an encoding of the Unicode character set such as UTF-8 and UTF-16; other times it identifies a different character set such as ISO-8859-1 or US-ASCII, which for XML's purposes serves mainly as an encoding of a subset of the full Unicode character set.

The default encoding is UTF-8 if no encoding declaration or other metadata is present. UTF-16 can also be used if the document begins with a byte order mark. However, even in cases where the document is written in the UTF-8 or UTF-16 encodings, an encoding declaration helps people reading the document recognize the encoding, so it's useful to specify it explicitly.

Try to stick to well-known standard character sets and encodings such as ISO-8859-1, UTF-8, and UTF-16 if possible. You should always use the standard names for these character sets. Table at next page lists the names defined by the XML 1.0 specification. All parsers that support these character sets should recognize these names. For character encodings not defined in XML 1.0, choose a name registered with the IANA. You can find a complete list at http://www.iana.org/assignments/character-sets/. However, you should avoid nonstandard names. In particular, watch out for Java names like 8859_1 and UTF16. Relatively few parsers not written in Java recognize these, and even some Java parsers don't recognize them by default. However, all parsers including those written in Java should recognize the IANA standard equivalents such as ISO-8859-1 and UTF-16.

For similar reasons, avoid declaring and using vendor-dependent character sets such as Cp1252 (U.S. Windows) or MacRoman. These are not as interoperable as the standard character sets across the heterogeneous set of platforms that XML supports.


The standalone Declaration

The standalone attribute has the value yes or no. If no standalone declaration is present, then no is the default.

A yes value means that no declarations in the external DTD subset affect the content of the document in any way. Specifically, the following four conditions apply.

  1. No default attribute values are specified for elements.

  2. No entity references used in the instance document are defined.

  3. No attribute values need to be normalized.

  4. No elements contain ignorable white space.

Table 1-1. Character Set Names Defined in XML

Name

Set

UTF-8

Variable width, byte order independent Unicode

UTF-16

Two-byte Unicode with surrogate pairs

ISO-10646-UCS-2

Two-byte Unicode without surrogate pairs; plane 0 only

ISO-10646-UCS-4

Four-byte Unicode

ISO-8859-1

Latin-1; mostly compatible with the standard U.S. Windows character set

ISO-8859-2

Latin-2

ISO-8859-3

Latin-3

ISO-8859-4

Latin-4

ISO-8859-5

ASCII and Cyrillic

ISO-8859-6

ASCII and Arabic

ISO-8859-7

ASCII and Greek

ISO-8859-8

ASCII and Hebrew

ISO-8859-9

Latin-5

ISO-8859-10

Latin-6

ISO-8859-11

ASCII plus Thai

ISO-8859-13

Latin-7

ISO-8859-14

Latin-8

ISO-8859-15

Latin-9, Latin-0

ISO-8859-16

Latin-10

ISO-2022-JP

A combination of ISO 646 (a slight variant of ASCII) and JIS X0208 that uses escape sequences to switch between the two character sets

Shift_JIS

A combination of JIS X0201:1997 .and JIS X0208:1997 that uses escape sequences to switch between the two character sets

EUC-JP

A combination of four code sets (ASCII, JIS X0208-1990, half width Katakana, and JIS X0212-1990) that uses escape sequences to switch between the character sets

If these conditions hold, the parser may choose not to read the external DTD subset, which can save a significant amount of time when the DTD is at a remote and slow web site.

A nonvalidating parser will not actually check that these conditions hold. For example, it will not report an error if an element does not have an attribute for which a default value is provided in the external DTD subset. Obviously the parser can't find mistakes that are apparent only when it reads the external DTD subset if it doesn't read the external DTD subset.

A validating parser is supposed to report a validity error if standalone has the value yes and any of these four conditions are not true.

It is always acceptable to set standalone to no, even if the document could technically stand alone. If you don't want to be bothered figuring out whether all of the above four conditions apply, just set standalone="no" (or leave it unspecified because the default is no). This is always correct.

The standalone declaration only applies to content read from the external DTD subset. It has nothing to do with other means of merging in content from remote documents such as schemas, XIncludes, XLinks, application-specific markup like the img element in XHTML, or anything else. It is strictly about the DTD.

Whatever values you pick for the version, encoding, and standalone attributes, and whether you include encoding and standalone attributes at all, you should provide an XML declaration. It only takes a few bytes and makes it much easier for both people and parsers to process your document.