NGram is a framework for textual distance analysis based on n-gram distribution fingerprinting
It is used as a research prototype to investigate how to mine equivalences between table columns.
|
What is this?
NGram is a framework for textual distance analysis based on n-gram distribution fingerprinting as outlined in the paper "N-Gram-Based Text Categorization" by Cavnar and Trenkle.
Although the original method was conceived to classify text based on language, this software implements a general purpose method to evaluate the distance between two strings which complexity is linear with the size of the strings.
How does it work?
Given two strings A and B, first we scan the strings and obtain all the n-grams of characters of length 1 to 5. Then we obtain a distribution of all the n-grams and chop-off the tail at a configurable length. The distance between the two strings is the then evaluated solely on the distance between the two n-gram profiles ( which can be considered as 'fingerprints' of the two strings).
A benefit of this approach is that the distance operates only on the fingerprints and therefore is much faster than if we were to evaluate, say, the Levenshtein distance of between the two strings (which is O(n*m) instead of O(n+m))
How to I use it?
For now there is nothing you can use, but watch this space.
Licensing and legal issues
NGram is open source software and is licensed under the BSD license.
Contributing
NGram is an open source software and built around the spirit of open participation and collaboration. There are several ways you can help:
- Blog about NGram
- Edit, fix or otherwise contribute content on the wiki
- Subscribe to our mailing lists to show your interest and give us feedback
- Report problems and ask for new features through our issue tracking system (but take a look at our todo list first)
- Send us patches or fixes to the code
Credits
Although the code has been completely rewritten, this software wouldn't exist without the inspiration drawn from the TCatNG project by Bruno Martins to which we are deeply thankful.
This software was created by the SIMILE project and in particular:
| Attribute values | |
|---|---|
| Glossary definition | NGram is a framework for textual distance analysis based on n-gram distribution fingerprinting + |

