|
| Gadget |
|---|
| User Guide |
| Design Principles |
| Browse the code |
| Download |
The XML Specification describes both a syntax and a model. The syntax is based on angle brakets, while the model is basically a tree of nodes, which could be either elements, attributes or text containers. The XML Infoset Specification describes the XML data model in detail.
The 'names' of those nodes are not dictated by the XML model or syntax (XML is, in fact, a metalanguage, a language to describe languages) and, in general, there is no reason why nodes with the same name or content should be reused throughout the same XML tree, but in real life the number of different 'names' for those nodes is very small compared to the amount of data that the XML tree contains. In practice, the amount of different 'names' is a function of the different namespaces used by the XML tree and not a function of the size of the XML tree itself.
This is the basic property upon which XML inspection starts to make sense: when the complexity of the structural patterns exhibited by the XML dataset is not a function of the size of the dataset. One example for all: the web contains billions of documents, and it is possible to convert all of them into XHTML and create a huge XML dataset several TeraBytes big. But the structural complexity of that XML dataset would not be a function of its size, but a function of the schemas used, mainly XHTML.
An example of a dataset that doesn't exhibit this property would be a dataset assembled taking a single document as an example of an instance from collection of schemas, but even in that case, real life usage of XML shows that the number of schemas grows a lot slower than the amount of XML data generated.
Another interesting property that real life XML datasets exhibit is the fact that 'semantic' information (information that shows the user how to understand this) is not encoded in the node name but in the entire sequence of names that leads to that node. In brevity, in its XPath. An example of this is the use of the node "email" that has no meaning by itself, if not indicating that it contains an email address, but the path fragment "person/email" will indicate that the "email" node is contained inside the "person" node and for that reason, is associated to that.
Real life analysis shows that the number of XPaths is a function of the complexity of the schema, therefore, again, not a function of the complexity of the dataset.
XML Inspection
The process of XML inspection is based on the following assumptions:
- the data is encoded in literals, either as textual nodes inside elements or as attribute values
- the metadata is encoded in the XPaths that identify those literals
- mixed content is considered as data
Given those assumptions, which stem from the fact that big datasets are normally encoding databases and therefore mixed content is rare or used as part of descriptions, we would like to "inspect" the dataset to have an overview of how it looks like and normally this means answering the following questions:
- what xpaths were used in this dataset?
- how many times were they used?
Another property that real life XML datasets exhibit is not only the high reusability of nodes and, in general, xpaths, but also the level of reusability of the literals contained in those paths. As an example, a path like "object/@id" will probably have a different literal associated to it for each use, while a path like "song/genre" will probably exhibit a higher reusability of the literals.
How Gadget Works
The key in dealing with large quantities of data is to process them as stream and store the results, rather than attemp to store all data in memory and process it from there. Obviously, this is good in theory, but in practice one must retain state about what happened before in order, for example, to be able to identify if a literal was found previously in the document, without having to scan the document everytime.
Gadget works as a SAX event handler and consumes the SAX events generated by the XML parser. These SAX events are used to generate a minimal tree representation in memory of all the xpaths that were found.
Gadget creates a different b-tree index on disk for each different xpath that it finds, plus another summary index that keeps a list of all the xpath found and their collective value. The collection all the xpath found is used to emerge a tree which is then shown by the web application as a condensed summary of the XML dataset.
The indexer is also responsible for generating the sparklines and the distribution charts since they don't change.
How Value Clustering Works
Gadget performs a primitive yet very scalable solution to find values in the same path that might be variations of the same concept or name. This is meant to be a guide for the data maintainers to find variants that might be considered noisy and misleading to users, especially when dealing with faceted browsing.
Clustering is performed by hashing collisions: each value is pruned and normalized and a key is extracted out of the value, when two different values generate the same key, they belong to the same cluster.
This simple yet very effective algorithm allows us to:
- Reuse the existing b-tree infrastructure used to store the other persistent maps
- Scale linearely with the amount of data to be processed because each key-generation is independent of the other values
The key generation algorithm performs the following steps:
- spaces are removed from the front and back of the value
- the value is transformed into its lowercase counterpart
- all punctuation is removed
- non-identifying tokens (mrs, mr, dr, miss...) are removed
- single char tokens are removed
- isolated digits are removed
- the remaining value is tokenized by spaces then:
- the tokens ordered alphanumerically
- the tokens are appended into one string
The resulting string is the cluster key.
Note: The above algorithm works best for names and it is, by no means, supposed to be a general purpose algorithm for text clustering. We plan to spend more research and implementation effort in this space in the future.


