<body background="yellow">
<hl>text node — content of element hl</hl>
<p>text node — content of element p</p>
</body>
(C) 2019 Masaryk University -- Tomáš Pitner, Luděk Bártek, Adam Rambousek
Logical and physical structure
Concepts: node, element, attribute, processing instruction, text node, comment
Fundamental requirement to all XML doc: it must be well-formed:
It contains prolog (heading) andexactly one root element. Before and after the root element, there can be processing instructions, comments (Misc).
It meets all the well-formedness constraints given in the specification.
Each of the parsed entities which is referenced directly or indirectly within the document is well-formed.
Higher-level requirement for an XML document: it can be valid.
XML document structure we distinguish between:
physical and
logical structure.
Application programmers are usually interested just on the logical structure,
while for the authors of content, XML editors, processors may also the physical structure be important.
A document is divided into elements (one of them is the root), their attributes, text nodes in elements, processing instructions, notations, comments.
One logical doc may be stored in one or more entities; at least in the document entity.
node which is a generic type, we further distinguish:
element (somethimes incorrectly called "tag" — while tag is just the opening and closing markup, not the whole element)
attribute (always attached to some element)
text node (the text between some markup)
processing instructions (not containing text content nor attributes, just for processing purposes)
comments (usually targeted to human readers)
uzel (element, atribut, textový uzel, instrukce pro zpracování, komentář)
element
atribut
textový uzel
instrukce pro zpracování
komentář
are objects delimited by start- and end-tags, examples:
<body background="yellow">
<hl>text node — content of element hl</hl>
<p>text node — content of element p</p>
</body>
If an element is empty (no child elements, neither text content inside), then we write just empty element tag, eg.:
<tagname tagattributel tagattribute2 ... />
such as
<hr width='507'/>
or equivalently (from logical viewpoint):
<hr width='507'></hr>
Always placed in the start-tag of an element, eg. <hr width='507'>`
The physical order of attributes within a start-tag is NOT significant and generally is NOT considered.
Attributes are thus simply "attached" to elements, carry "additional info" to elements - eg. its ID
,
required formatting (style) in case of (X)HTML, or links to other elements
Conceptually, we could replace all attributes with elements but we keep attributes to maintain readability.
The attribute content should NOT be further structured
Attribute value is not structured at least according to XML standards. An application may see it other way but generally it is not recommended, cf. the same holds for attributes in relational data model.
An attribute is composed of its name and value.
Attributes are inserted in the start-tag which may be empty.
Attribute value is always in quotes ('
) or doublequotes ("
) add separated
by a =
from the attribute name.
Writing width="750"
or width = "750"
or width='750'
means completely the same.
For attribute names the same rules as for element names hold.
In one element, there can never be two or more attributes with same name.
If namespaces are used, neither two attributes with same name belonging to the same namespace are allowed.
<table border='l'>
<tr>
<td>jedna</td>
<td>dve</td>
</tr>
<tr>
<td>tri</td>
<td>ctyri</td>
</tr>
</table>
They carry textual information, textual content.
Eg. in the next sample, the text ahoj!
is the text node and not the whole element em
<em>ahoj!</em>
Processing-instructions are written using <?target content?>
markup.
They inform an application about the expected processing or setting.
They do not carry content.
<?xsl-stylesheet href="mystyle.xsl"?>
href
does not mean an attribute; Processing-instructions do not contain attributes.
Notation is enclosed in <!NOTATION name declaration>
It is mostly used to describe binary / non-XML entities - eg. images GIF, PNG,…
It is a declaration how to process the binary data.
Entity is a basic unit of physical document composition.
Corresponds to a character, string, or whole file.
Parsers should process the entities so that the applications do not know about them.
We distinguish:
parent of the root element; may contain also Pis, notations, DOCTYPE etc. and
is the core part of an XML doc. In every file, there is just one.
in the next chapter XML family standards.
Comments
Similarly to HTML - comment is enclosed into
<!--content-→
The comment content is content, NOT the the whole comment including markup.
Comments are usually not important for processing but it may depend on application, eg. Servlet-side Includes (SSI) use comments.
Parsers therefore should be able to forward comments to the applications.
SAX parsers ignore this in version 1; do so in version SAX2, in Java the package org.xml.sax.ext.