Cody Boisclair, November–December 2007
The Morphological Parser is designed to parse common inflectionary forms in English into their component stems and features. Currently it handles the -s form of verbs and nouns, the -ed, -en, and -ing forms of verbs, and the -er and -est forms of adjectives and adverbs, including some common irregular spellings of all these forms.
The algorithm for the handling of regular morphological forms is based on the Prolog code provided in figure 9.5 of Natural Language Processing for Prolog Programmers (Covington, 1994).
Also embedded in the class library is a lexicon of common irregular forms, as given in The Oxford English Grammar (Greenbaum, 1996) and A Practical English Grammar (second edition, Thomson and Martinet, 1980), which is consulted during the parse and which overrides any analysis of the form as regular if a match is found. Both these and all regular forms may be overridden further by the user, as explained in the description of the
IrregularsFilePath property below.
The morphological parser exists in the form of a class library (DLL) known as
MorphParser.dll, which consists of the following classes and enumerations, all in the
MorphParser: The static class which performs the actual morphological analysis.
The following public methods are provided:
ParseWord(string): Parses the morphology of the single word specified in the parameter, returning a
List<List<Morph>>, in which each component
List<Morph>represents one possible parse.
TokenizeAndParse(string): Uses the
Tokenizerclass library to tokenize the given string, then calls
ParseWordon each token that is identifiable as a word (i.e., neither numeric nor punctuation). Returns a
List<List<List<Morph>>>, where each top-level element represents the result of running
ParseWordon a single token from the input string.
IsIrregularForm(string, SyntacticCat): Determines whether the specified string is an irregular form of the specified syntactic category.
IsIrregularStem(string, SyntacticCat): Determines whether the specified string is a stem of an irregular form of the specified syntactic category.
The following properties control certain aspects of the morphological analysis:
IrregularsFilePath: Optionally specifies the path of an XML file in which additional irregular forms are defined. Irregular forms defined in this external file override both regular forms processed by the parser and irregular forms defined in the built-in lexicon. The format of this file is described in this document; a further example, containing the embedded lexicon of irregular forms, can be found in
IrregularsCaseSensitive: If true, irregular forms are matched from both the embedded list and the optional external list only when the capitalization matches exactly; if false, case is ignored when matching irregular forms. Default is false.
PreserveCase: If true, the stem remains capitalized as it was in the original token; if false, the stem is always lowercased. Default is true.
An additional property named
Version returns a
System.Reflection.Version object identifying the version number of the class library.
Token: Represents a single morph derived from the analysis. Contains the following properties:
Spelling: The spelling/surface representation of the morph. In the case of feature morphs, the most common English spelling of the feature is used, rather than the spelling originally found in the token (for instance, "feet" is represented by two
Spellings of "foot" and "s", respectively).
MorphTypeidentifying whether the morph is the stem, an added feature, or a non-word (e.g., number or punctuation).
SyntacticCatidentifying the syntactic category into which the morph falls, if known, or set to
MorphType: An enumeration defining the types used in the
Type property of the
Morph class. Currently supported types are
Stem (the stem of the word),
Feature (an affix added on to the stem), and
NonWord (a 'morph' that is not actually a word, e.g., numbers and punctuation).
SyntacticCat: An enumeration defining the categories used in the
Category property of the
Morph class. Essentially, these represent the part of speech of a stem, or the part of speech onto which a feature is attached. Currently supported categories are
At this time, there is no use of a lexicon to narrow down possible spellings of the stem for regular forms; rather, all possible spellings are generated and returned by the parser. (For instance, the word "quickest" may be analyzed as "quick"+"est", "quicke"+"est", "quic"+"est", or an uninflected "quickest".) The results are returned in the form of a
List<List<Morph>> object, however, which means that each parse may easily be iterated through and eliminated as necessary in the program using the results provided by the parser. I am considering eventually adding routines for optionally rejecting parses during the parsing process based on a provided lexicon, but these have not been added yet.
I have also included a basic command-line tool, known as
MorphParserTest, which uses the MorphParser library to tokenize either a token or a string specified as arguments to the command and displays the results with one token's analysis per line. A compiled version of the program may be found in
Examples of how to use the program:
MorphParserTest -s "The quick brown fox jumps over the lazy dog."
Like the Tokenizer, as the bulk of the work has been done on the actual class library itself, I'll admit that this interface is quite primitive; it was mainly designed for my own testing of the class library.
This is a 1.0 beta release; as far as I can tell, it appears to work acceptably, but there may be bugs I have not yet encountered in my own testing. Use at your own discretion. The build number for the compiled assembly included in this package is 1.0.3002.644.
If you have any comments, suggestions, or potential improvements, don't hesitate to drop me a line at firstname.lastname@example.org.