Bulldozer: Internal Structure Of the HTML File.
Background on ADT trees
The internal structure representation with Bulldozer is that of an Abstract
Data Type (ADT) tree. A tree is a structure which is like that of the familiar
family tree. A node is equivalent to a person is a family tree. For our
purposes assume that the nodes of our tree are asexual (A node may have a
child by itself). We call a child of a node a child and we call an ancestor
(parent or grandparent) an ancestor and we call the immediate ancestor the
parent. Any given node may have any number of children but only one parent.
The root node is the only node in the tree without a parent. So a tree may
look like the following:
one
/ | \
/ | \
two five six
/\
/ \
three four
In this diagram the following are true:
one is the root node .
two is a child of one.
one is the parent of two.
three and four are said to be leaf nodes. A leaf node is a node
without children.
two and six are said to be siblings of five as they are on
the same level.
two is the left sibling of five and six the right sibling
.
two is the leftmost child of one and six is the right
most child of one.
HTML and the ADT tree
The most natural structure for an HTML file (or any SGML DTD) is the tree. An
HTML file consists of many different mark types which defines an attribute for
the elements nested within that mark. That is, within an HTML file, to achieve
bold italics text would be as follows.
<html>
<head>
<title>Test Page</title>
</head>
<body>
<b><i>This is a sample of bold italics text</i></b>
</body>
</html>
Similarly as a tree structure this would look like:
<html>
/ \
/ \
<head> <body>
| |
<title> <b>
| |
Test Page <i>
|
This is a...
This is precisely the way Bulldozer represents an HTML file that it stores
within it's edit buffers.
Advantages of the tree
The advantages of using the tree format rather than converting to another
format is that the HTML is directly represented in memory as it is in the file
(at least ideally as we shall see). Therefore, each mark can be addressed
directly, modified and arranged as it would be in the resulting HTML file. It
also allows for some rather crafty algorithms for manipulating the HTML.
Disadvantages
The major disadvantage of this format is that many of the browsers to not use
this representation and allow wrongly formatted HTML files to be read without
warning. This is problem causes Dozer to have to do some extra processing on
wrongly formatted files. The results are not always what the user intended.
Furthermore it is the desire of Bulldozer to output only correctly formatted
files causing Bulldozer to attempt to fix bad files. Better algorithms are
being designed to allow fixing the files with minimal user interaction.
Currently there may only minimal attempts at fixing really bad files.