By Andreas Schuster
Copyright © 2007 int for(ensic){blog;}. All rights reserved.
Parsing clear text can be a tedious piece of work, especially in the case of XML, which is known for its low entropy. Under the premise that XML messages are frequently re-read, the conversion of certain language elements into "tokens" can save a significant amount of computational power and also cuts down on the requirements for storage space.
Let's start with an example. We are going to tokenize the following piece of textual XML:
<EventID>1234</EventID>
This is translated into the following sequence of tokens:
- The first angle bracket, which opens the start tag of our container element, becomes the #OpenStartElementTag# token.
- The string "EventID" is left untouched for simplicity (in fact this is turned into an application token, so it can be reused later).
- The next angle bracket closes the start tag of our element, so it becomes the #CloseStartElementTag# token.
- Then the payload follows. If it were an XML stream then it would get tokenized, too.
- Finally there's the end tag of our container element. It becomes the #EndElementTag# token. Notice that the tag's name is given by the sequence of #OpenStartElementTag# tokens, so there's no need to repeat it.
While these tokens describe elements, some others mark the beginning and the end of the binary XML stream, define attributes and their values or are simply placeholders for varying data. Here is the complete list:
System Tokens
| Value |
Meaning |
Example |
| 0x00 |
EndOfBXmlStream |
|
| 0x01 |
OpenStartElementTag |
< name > |
| 0x02 |
CloseStartElementTag |
< name > |
| 0x03 |
CloseEmptyElementTag |
< name /> |
| 0x04 |
EndElementTag |
</ name > |
| 0x05 |
Value |
attribute = "value" |
| 0x06 |
Attribute |
attribute = "value" |
| 0x0c |
TemplateInstance |
|
| 0x0d |
NormalSubstitution |
|
| 0x0e |
OptionalSubstitution |
|
| 0x0f |
StartOfBXmlStream |
|
So this is what the tokenized XML sequence will look like (again leaving out the encoding of the element's name and the contained data): 0x0f 0x01 EventID 0x02 1234 0x04 0x00
The high-nibble contains flags. So far I've seen only the value 0x40, which indicates that at least one attribute will follow the tag. This flag can frequently be seen in conjunction with the OpenStartElementTag token: 0x41.
As you can see there are some values missing in the table above. Most likely the "holes" are associated with some other XML language elements like character data (CDATA) sections, character and entity references and processing instructions. However, I yet have to see those in real-life log files.