Bulldozer: Internal Structure Of the HTML File.


Background on ADT trees

The internal structure representation with Bulldozer is that of an Abstract Data Type (ADT) tree. A tree is a structure which is like that of the familiar family tree. A node is equivalent to a person is a family tree. For our purposes assume that the nodes of our tree are asexual (A node may have a child by itself). We call a child of a node a child and we call an ancestor (parent or grandparent) an ancestor and we call the immediate ancestor the parent. Any given node may have any number of children but only one parent. The root node is the only node in the tree without a parent. So a tree may look like the following:
                                 one
                              /   |   \
                             /    |    \
                           two  five  six
                           /\
                          /  \
                       three four
In this diagram the following are true:
one is the root node .
two is a child of one.
one is the parent of two.
three and four are said to be leaf nodes. A leaf node is a node without children.
two and six are said to be siblings of five as they are on the same level.
two is the left sibling of five and six the right sibling .
two is the leftmost child of one and six is the right most child of one.

HTML and the ADT tree

The most natural structure for an HTML file (or any SGML DTD) is the tree. An HTML file consists of many different mark types which defines an attribute for the elements nested within that mark. That is, within an HTML file, to achieve bold italics text would be as follows.
<html>
<head>
<title>Test Page</title>
</head>
<body>
<b><i>This is a sample of bold italics text</i></b>
</body>
</html>
Similarly as a tree structure this would look like:
                         <html>
                          / \
                         /   \
                    <head>   <body>
                       |       |
                    <title>   <b>
                       |       |
                   Test Page  <i>
                              |
                           This is a...
This is precisely the way Bulldozer represents an HTML file that it stores within it's edit buffers.

Advantages of the tree

The advantages of using the tree format rather than converting to another format is that the HTML is directly represented in memory as it is in the file (at least ideally as we shall see). Therefore, each mark can be addressed directly, modified and arranged as it would be in the resulting HTML file. It also allows for some rather crafty algorithms for manipulating the HTML.

Disadvantages

The major disadvantage of this format is that many of the browsers to not use this representation and allow wrongly formatted HTML files to be read without warning. This is problem causes Dozer to have to do some extra processing on wrongly formatted files. The results are not always what the user intended. Furthermore it is the desire of Bulldozer to output only correctly formatted files causing Bulldozer to attempt to fix bad files. Better algorithms are being designed to allow fixing the files with minimal user interaction. Currently there may only minimal attempts at fixing really bad files.