Cody Boisclair, November–December 2007
The Tokenizer is, as its name suggests, a tool for splitting a string of English text into its component tokens. It is based on a public-domain sed
script developed by Robert MacIntyre of the University of Pennsylvania for use in the development of the Penn Treebank (link to original script), but with modifications made by myself (Cody Boisclair) so as to tokenize the following more accurately and consistently:
It exists in the form of a class library (DLL) known as Tokenizer.dll
, which consists of the following classes and enumeration, all in the NLPUtils
namespace:
Tokenizer
: The static class which performs the actual tokenization. The Tokenize(string)
method is used to tokenize a string, and returns a List
of Token
objects (whose structure is described below). There are also four boolean static properties defining certain aspects of tokenization:
ParsedBrackets
: If true, the Tokenize method converts brackets to the form used by taggers such as MXPOST (e.g.: "-LRB-", "-RRB-", etc.); if false, it leaves them verbatim. False by default. This was an option in the original Penn Treebank script which I have maintained for compatibility.ExpandWhats
: If true, expands the words "whaddya" and "whatcha" into separate tokens for "what", "do" and "you"; if false, leaves them as single words. False by default. Like the previous property, this was an option in the original Penn script. PreserveCase
: If true, the Tokenize
method leaves tokens in their original case; if false, lowercases them. True by default.TokenizeOnlyEndPeriod
: If true, periods are only tokenized if they occur at the end of a string, as in the Penn tokenizer. This is useful if a document has already been split into its component sentences, as it produces more accurate results in that case. False by default. PeriodInAbbreviation
: if true, leaves the period at the end of certain known abbreviations as part of the same token, rather than separating it into a separate token when TokenizeOnlyEndPeriod
is false. True by default. An additional property named Version
returns a System.Reflection.Version
object identifying the version number of the class library.
Token
: Represents a single token, with properties for both its string representation (Content
) and the category into which it fits (Type
).
TokenType
: An enumeration defining the types used in the Type
property of the Token
class. Currently supported types are Word
, Number
and Punct
, all of whose names should be fairly self-explanatory.
An additional method, TokenizeToStringList(string)
, has been added in version 1.1. This method behaves just like Tokenize(string)
, except that the output is a list of strings, with each string representing the lexeme for a token, rather than a list of Token
objects.
I have also included a basic command-line tool, known as TokenizerTest
, which uses the Tokenizer library to tokenize either a string or a file specified as arguments to the command and displays the results with one token per line. A compiled version of the program may be found in NLPUtils\TokenizerTest\bin\Release
.
Examples of how to use the program:
TokenizerTest "The quick brown fox jumps over 1,234 lazy dogs."
TokenizerTest -f \foo\bar\baz.txt
It's a very primitive interface right now, as the bulk of the work has been done on the actual class library itself. I may soon be releasing a graphical interface for the Tokenizer library as an additional download when I have the time.
This is release 1.1 of the Tokenizer class; the build number for the compiled assembly included in this package is 1.1.3002.643.
If you have any comments, suggestions, or potential improvements, don't hesitate to drop me a line at codemanb@uga.edu.