NLPUtils Tokenizer

Cody Boisclair, November–December 2007

The Tokenizer is, as its name suggests, a tool for splitting a string of English text into its component tokens. It is based on a public-domain sed script developed by Robert MacIntyre of the University of Pennsylvania for use in the development of the Penn Treebank (link to original script), but with modifications made by myself (Cody Boisclair) so as to tokenize the following more accurately and consistently:

It exists in the form of a class library (DLL) known as Tokenizer.dll, which consists of the following classes and enumeration, all in the NLPUtils namespace:

An additional method, TokenizeToStringList(string), has been added in version 1.1. This method behaves just like Tokenize(string), except that the output is a list of strings, with each string representing the lexeme for a token, rather than a list of Token objects.

I have also included a basic command-line tool, known as TokenizerTest, which uses the Tokenizer library to tokenize either a string or a file specified as arguments to the command and displays the results with one token per line. A compiled version of the program may be found in NLPUtils\TokenizerTest\bin\Release.

Examples of how to use the program:

TokenizerTest "The quick brown fox jumps over 1,234 lazy dogs."
Tokenizes the sentence "The quick brown fox jumps over 1,234 lazy dogs." and displays the results.
TokenizerTest -f \foo\bar\baz.txt
Tokenizes the contents of the file \foo\bar\baz.txt and displays the results.

It's a very primitive interface right now, as the bulk of the work has been done on the actual class library itself. I may soon be releasing a graphical interface for the Tokenizer library as an additional download when I have the time.

This is release 1.1 of the Tokenizer class; the build number for the compiled assembly included in this package is 1.1.3002.643.

If you have any comments, suggestions, or potential improvements, don't hesitate to drop me a line at