# NLPUtils Tokenizer

Cody Boisclair, November–December 2007

The Tokenizer is, as its name suggests, a tool for splitting a string of English text into its component tokens. It is based on a public-domain sed script developed by Robert MacIntyre of the University of Pennsylvania for use in the development of the Penn Treebank (link to original script), but with modifications made by myself (Cody Boisclair) so as to tokenize the following more accurately and consistently:

• periods that are not at the end of a sentence, with the exception of certain common abbreviations (both as options)
• numbers with decimals and commas in them, and times with colons in them
• single quotation marks at the beginning of a quotation

It exists in the form of a class library (DLL) known as Tokenizer.dll, which consists of the following classes and enumeration, all in the NLPUtils namespace:

• Tokenizer: The static class which performs the actual tokenization. The Tokenize(string) method is used to tokenize a string, and returns a List of Token objects (whose structure is described below). There are also four boolean static properties defining certain aspects of tokenization:

• ParsedBrackets: If true, the Tokenize method converts brackets to the form used by taggers such as MXPOST (e.g.: "-LRB-", "-RRB-", etc.); if false, it leaves them verbatim. False by default. This was an option in the original Penn Treebank script which I have maintained for compatibility.
• ExpandWhats: If true, expands the words "whaddya" and "whatcha" into separate tokens for "what", "do" and "you"; if false, leaves them as single words. False by default. Like the previous property, this was an option in the original Penn script.
• PreserveCase: If true, the Tokenize method leaves tokens in their original case; if false, lowercases them. True by default.
• TokenizeOnlyEndPeriod: If true, periods are only tokenized if they occur at the end of a string, as in the Penn tokenizer. This is useful if a document has already been split into its component sentences, as it produces more accurate results in that case. False by default.
• PeriodInAbbreviation: if true, leaves the period at the end of certain known abbreviations as part of the same token, rather than separating it into a separate token when TokenizeOnlyEndPeriod is false. True by default.

An additional property named Version returns a System.Reflection.Version object identifying the version number of the class library.

• Token: Represents a single token, with properties for both its string representation (Content) and the category into which it fits (Type).

• TokenType: An enumeration defining the types used in the Type property of the Token class. Currently supported types are Word, Number and Punct, all of whose names should be fairly self-explanatory.

An additional method, TokenizeToStringList(string), has been added in version 1.1. This method behaves just like Tokenize(string), except that the output is a list of strings, with each string representing the lexeme for a token, rather than a list of Token objects.

I have also included a basic command-line tool, known as TokenizerTest, which uses the Tokenizer library to tokenize either a string or a file specified as arguments to the command and displays the results with one token per line. A compiled version of the program may be found in NLPUtils\TokenizerTest\bin\Release.

Examples of how to use the program:

TokenizerTest "The quick brown fox jumps over 1,234 lazy dogs."
Tokenizes the sentence "The quick brown fox jumps over 1,234 lazy dogs." and displays the results.
TokenizerTest -f \foo\bar\baz.txt
Tokenizes the contents of the file \foo\bar\baz.txt and displays the results.

It's a very primitive interface right now, as the bulk of the work has been done on the actual class library itself. I may soon be releasing a graphical interface for the Tokenizer library as an additional download when I have the time.

This is release 1.1 of the Tokenizer class; the build number for the compiled assembly included in this package is 1.1.3002.643.

If you have any comments, suggestions, or potential improvements, don't hesitate to drop me a line at codemanb@uga.edu.