Cody Boisclair, November–December 2007
The Tokenizer is, as its name suggests, a tool for splitting a string of English text into its component tokens. It is based on a public-domain
sed script developed by Robert MacIntyre of the University of Pennsylvania for use in the development of the Penn Treebank (link to original script), but with modifications made by myself (Cody Boisclair) so as to tokenize the following more accurately and consistently:
It exists in the form of a class library (DLL) known as
Tokenizer.dll, which consists of the following classes and enumeration, all in the
Tokenizer: The static class which performs the actual tokenization. The
Tokenize(string) method is used to tokenize a string, and returns a
Token objects (whose structure is described below). There are also four boolean static properties defining certain aspects of tokenization:
ParsedBrackets: If true, the Tokenize method converts brackets to the form used by taggers such as MXPOST (e.g.: "-LRB-", "-RRB-", etc.); if false, it leaves them verbatim. False by default. This was an option in the original Penn Treebank script which I have maintained for compatibility.
ExpandWhats: If true, expands the words "whaddya" and "whatcha" into separate tokens for "what", "do" and "you"; if false, leaves them as single words. False by default. Like the previous property, this was an option in the original Penn script.
PreserveCase: If true, the
Tokenizemethod leaves tokens in their original case; if false, lowercases them. True by default.
TokenizeOnlyEndPeriod: If true, periods are only tokenized if they occur at the end of a string, as in the Penn tokenizer. This is useful if a document has already been split into its component sentences, as it produces more accurate results in that case. False by default.
PeriodInAbbreviation: if true, leaves the period at the end of certain known abbreviations as part of the same token, rather than separating it into a separate token when
TokenizeOnlyEndPeriodis false. True by default.
An additional property named
Version returns a
System.Reflection.Version object identifying the version number of the class library.
Token: Represents a single token, with properties for both its string representation (
Content) and the category into which it fits (
TokenType: An enumeration defining the types used in the
Type property of the
Token class. Currently supported types are
Punct, all of whose names should be fairly self-explanatory.
An additional method,
TokenizeToStringList(string), has been added in version 1.1. This method behaves just like
Tokenize(string), except that the output is a list of strings, with each string representing the lexeme for a token, rather than a list of
I have also included a basic command-line tool, known as
TokenizerTest, which uses the Tokenizer library to tokenize either a string or a file specified as arguments to the command and displays the results with one token per line. A compiled version of the program may be found in
Examples of how to use the program:
TokenizerTest "The quick brown fox jumps over 1,234 lazy dogs."
TokenizerTest -f \foo\bar\baz.txt
It's a very primitive interface right now, as the bulk of the work has been done on the actual class library itself. I may soon be releasing a graphical interface for the Tokenizer library as an additional download when I have the time.
This is release 1.1 of the Tokenizer class; the build number for the compiled assembly included in this package is 1.1.3002.643.
If you have any comments, suggestions, or potential improvements, don't hesitate to drop me a line at firstname.lastname@example.org.