« PreviousContinue »
METHOD AND SYSTEM FOR ENCODING A
MARK-UP LANGUAGE DOCUMENT
The present invention relates generally to data compression techniques and, more particularly, to techniques for compactly encoding mark-up language documents.
BACKGROUND OF THE INVENTION 10
Mark-up languages, such as Hypertext Mark-up Language (HTML) and Extensible Mark-up Language (XML), have been in widespread use for the past several years. Mark-up languages allow software developers to create documents that 15 include a variety of data items, such as text, logos, pictures, and sounds, which can then be rendered by various types of programs, such as web browsers. Mark-up languages use special notations, referred to as tags, to identify data items, and to indicate how the data items are to be processed. These 20 tags also allow computer programs, such as parsers and web browsers, to search, sort, identify and extract data from the document. While mark-up languages make the use and interchange of data easier and more user-configurable, the addition of tags along with the data substantially increases the size 25 of data files. This increase in file size or "bloat" can be considerable, and creates problems when data has to be transmitted quickly or stored compactly.
SUMMARY OF THE INVENTION 30
In accordance with the foregoing, a method and system for encoding a mark-up language document is provided. According to the invention, the structure of the mark-up language document is condensed by removing those parts of the struc- 35 ture that are fixed, and by expressing the variable parts of the structure in terms of which elements occur, whether elements occur, or how often certain elements occur. This may involve separating the structure of the mark-up language document from its content, and treating the structure and content differ- 40 ently. To encode a block of mark-up language text according to an embodiment of the invention, a template is used to determine which of the elements of the block have a fixed number of occurrences and which of the elements have a variable number of occurrences. The structure of the block is 45 represented with a compact block of text that expresses the number of occurrences of the elements that have a variable number of occurrences, but that does not contain information regarding the elements that have a fixed number of occurrences. In various embodiments of the invention, the content 50 of the mark-up language document is, itself, compressed by grouping similar or related data items together.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments that proceeds with reference to the 55 accompanying figures.
BRIEF DESCRIPTION OF THE DRAWINGS
While the appended claims set forth the features of the 60 present invention with particularity, the invention may be best understood from the following detailed description taken in conjunction with the accompanying drawings of which:
FIG. 1 shows an example of a computer network in which the invention may be practiced; 65
FIG. 2 shows an example of a computer on which at least some parts of the invention may be implemented;
FIG. 3 shows an example of a procedure that may be used in an embodiment of the invention; and
FIG. 4 shows an example of a tree that is created based on a template in an embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
Prior to proceeding with a description of the various embodiments of the invention, a description of the computer and networking environment in which various embodiments of the invention may be practiced will now be provided. Although it is not required, the present invention may be implemented by program modules that are executed by a computer. Generally, program modules include routines, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. The term "program" as used herein may connote a single program module or multiple program modules acting in concert. The invention may be implemented on a variety of types of computers. Accordingly, the terms "device" and "computer" as used herein include personal computers (PCs), hand-held devices, multi-processor systems, microprocessorbased programmable consumer electronics, network PCs, minicomputers, mainframe computers and the like. The invention may also be employed in distributed computing environments, where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, modules may be located in both local and remote memory storage devices.
An example of a networked environment in which the invention may be used will now be described with reference to FIG. 1. The example network includes several computers 100 communicating with one another over a network 151, represented by a cloud. Network 151 may include many wellknown components, such as routers, gateways, hubs, etc. and may allow the computers 100 to communicate via wired and/or wireless media.
Referring to FIG. 2, an example of a basic configuration for a computer on which the system described herein may be implemented is shown. In its most basic configuration, the computer 100 typically includes at least one processing unit 112 and memory 114. Depending on the exact configuration and type of the computer 100, the memory 114 may be volatile (such as RAM), non-volatile (such as ROM or flash memory) or some combination of the two. This most basic configuration is illustrated in FIG. 2 by dashed line 106. Additionally, the computer may also have additional features/ functionality. For example, computer 100 may also include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer 100. Any such computer storage media may be part of computer 100.
Computer 100 may also contain communications connections that allow the device to communicate with other devices. A communication connection is an example of a communication medium. Communication media typically
embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. By way of example, and not limitation, communication media includes wired 5 media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media. 10
Computer 100 may also have input devices such as a keyboard, mouse, pen, voice input device, touch input device, etc. Output devices such as a display 118, speakers, a printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here. 15
As used herein, the term "mark up language" refers to any computer interpretable language that describes the structure of a document. Examples of mark-up languages include Standard Generalized Mark-up Language (SGML) and all of its variants, including Hypertext Mark-up Language (HTML), 20 Extensible Mark-up Language (XML) and Extensible Style Sheet Language (XSL). Furthermore, the term "mark-up language document" refers to any document that contains markup language. Finally, the phrase "block of mark-up language" is used to indicate a mark-up language document or a portion 25 thereof.
The invention is generally directed to a method and system for encoding a mark-up language document. According to the invention, the structure of the mark-up language document is condensed by removing those parts of the structure that are fixed, and by expressing the variable parts of the structure in terms of which elements occur, whether elements occur, or how often certain elements occur. In various embodiments of the invention, other steps may be taken to further reduce the size of the mark-up language document. These steps include one or more of: separating the data items of the document from its structure, grouping together data items with related meaning or data-type, and encoding data items in their native format.
According to various embodiments of the invention, both the sender of the mark-up language document and the receiver of the mark-up language document possess a common template that defines a default structure for a document type. This default structure includes fixed elements—those elements that are always present in fixed number of occurrences in a mark-up language document of that type. The default structure also includes variable elements—those elements that may or may not be present in a mark-up language document of that type, or are always present, but may have any number of occurrences.
For example, the mark-up language document shown in Table 1 describes a "books" element, which describes a set of books:
<title> Wireless & Networking -1 </title>
<firstName> Peter </firstName> <lastName>
<ISBN> 0-201-37928-7 </ISBN>
<price> 99.99 </price>
<title>Wireless & Networking - II </title>
<flrstName> Peter </flrstName> <lastName>
<flrstName> John </firstName> <lastName>
<price> 59.99 </price> </book> </books>
The document of Table 1 is written using the Extensible Mark-up Language (XML). Within the "books" element are "books" elements, each of which describes a book. Each "book" element includes a "title" element, an "authors" element, a "price" element and, optionally, an "ISBN" element. The "title" element includes the title of the book. The "authors" element describes all of the authors of the book. The "authors" element includes one or more "author" elements. Each "author" element has a "firstName" element and a "lastName" element. The "firstName" element contains the first name of the author, while the "lastName" element contains the last name of the author. Each element of the document of Table 1 is bounded by a pair of tags. For example, the first "title" element in Table 1 is <title>Wireless & Networking—I</title>. The data contained in the element is Wireless & Networking—I. The element is bounded by the tags < title>and </title>. Collectively, all of the tags of the mark-up language document of Table 1 constitute the structure of the document.
Referring to Table 2, an example of a template that that may be used for the mark-up language document of Table 1 according to an embodiment of the invention will now be described. The template is formatted as an XML Document Type Definition (DTD):
<!ELEMENT book (title, authors, ISBN?, price)>
<!ELEMENT title (#PCDATA)>
<! ELEMENT authors (author)+>
<!ELEMENT author (firstName, lastName)>
<!ELEMENT ISBN (#PCDATA)>
<!ELEMENT firstName (#PCDATA)>
<!ELEMENT lastName (#PCDATA)>
<!ELEMENT price (#PCDATA)>
As an alternative, the template may be formatted as a schema, as shown in Table 3:
<xsd:schema xmlns:xsd='http://www. w3.org/2001/XMLSchema'>
<xsd :element name= 'authors 'type= 'Authors 7>
<xsd:element name='ISBN,type='xsd:string' minOccurs'07>
<xsd:element name='price' type='xsd:decimal7> </xsd:sequence> </xsd:complexType> </xsd:element>
There are many other possible ways to format the template. For example, other mark-up languages or programming languages besides those shown herein may be used.
The meaning of some of the labels and terms of the DTD document of Table 2 will now be described. The label "#PCDATA" signifies parsed character data (plain text, for example). The symbols ?, + and * are used as follows:
"?" signifies an element that either does not appear or appears one time
"+" signifies an element that appears one or more times
"*" signifies an element that can appear any number of times or not at all
Those elements of the template of Table 2 that have one or the above symbols are variable elements, meaning that the structure of an XML document of this type may vary with respect to these elements, and do so in the way dictated by the symbols. All other elements defined in the template are fixed elements, meaning that the structure of an XML document of this type is exactly as specified in the template with respect to those elements. For example, according to the template of 35 Table 2, each book element contains exactly one instance each of a title element and a price element. However, the ISBN element may appear once or not at all. Similarly "authors" element contains one or more instances of the "author" element.
According to an embodiment of the invention, a sender that wishes to send a mark-up language document to a receiver removes all of the tags that are associated with elements that are fixed—those elements that the sender and receiver both realize will be always be in the document and will always appear in a fixed number of instances. Furthermore, the sender need not send the tags associated with variable elements, but only needs to send information indicating how many instances of each variable element exist in the mark-up language document being sent. For example, if a sender and receiver each have a copy of the DTD document of Table 2, and they each understand that the mark-up language document that is to be sent conforms to the DTD document of Table 2, then the entire structure of the mark-up language document of Table 1 (that is, the tags) can be represented as
^-(integer) ^-(single bit), ^-(integer) ^(single bit)'
where the first number in each pair is the number of occurrences of the "author" element and the second number is a single bit that is set high if there is an "ISBN" element and low if there is not.
According to some embodiments of the invention, the data items of a mark-up language document and the structure of the mark-up language document are separated from one another. Separating the data items from the structure allows each to be processed separately. Processing the structure (the tags, for example) involves techniques such as using a template, as described above. Processing the data items (the text
<book> <title> </title> <authors>
<firstName> </firstName> <lastName> </lastName>
<firstName> </firstName> <lastName> </lastName> </author> <author>
<firstName> </firstName> <lastName> </lastName> </author> </authors>
<price> </price> </book> </books>
Wireless & Networking - II