NLPUtils Morphological Parser

Cody Boisclair, November–December 2007


The Morphological Parser is designed to parse common inflectionary forms in English into their component stems and features. Currently it handles the -s form of verbs and nouns, the -ed, -en, and -ing forms of verbs, and the -er and -est forms of adjectives and adverbs, including some common irregular spellings of all these forms.

The algorithm for the handling of regular morphological forms is based on the Prolog code provided in figure 9.5 of Natural Language Processing for Prolog Programmers (Covington, 1994).

Also embedded in the class library is a lexicon of common irregular forms, as given in The Oxford English Grammar (Greenbaum, 1996) and A Practical English Grammar (second edition, Thomson and Martinet, 1980), which is consulted during the parse and which overrides any analysis of the form as regular if a match is found. Both these and all regular forms may be overridden further by the user, as explained in the description of the IrregularsFilePath property below.

The morphological parser exists in the form of a class library (DLL) known as MorphParser.dll, which consists of the following classes and enumerations, all in the NLPUtils namespace:

At this time, there is no use of a lexicon to narrow down possible spellings of the stem for regular forms; rather, all possible spellings are generated and returned by the parser. (For instance, the word "quickest" may be analyzed as "quick"+"est", "quicke"+"est", "quic"+"est", or an uninflected "quickest".) The results are returned in the form of a List<List<Morph>> object, however, which means that each parse may easily be iterated through and eliminated as necessary in the program using the results provided by the parser. I am considering eventually adding routines for optionally rejecting parses during the parsing process based on a provided lexicon, but these have not been added yet.


I have also included a basic command-line tool, known as MorphParserTest, which uses the MorphParser library to tokenize either a token or a string specified as arguments to the command and displays the results with one token's analysis per line. A compiled version of the program may be found in NLPUtils\TokenizerTest\bin\Release.

Examples of how to use the program:

MorphParserTest babies
Performs a morphological analysis of the word "babies" and displays the result:
{ (Unknown) baby +s | (Unknown) babie +s | (Unknown) babies }
(Note that the "Unknown" represents that the parser does not recognize whether the stem is a noun or a verb.)
MorphParserTest -s "The quick brown fox jumps over the lazy dog."
Tokenizes the specified sentence, performs a morphological analysis of each token, and displays the analysis of each token on a separate line.

Like the Tokenizer, as the bulk of the work has been done on the actual class library itself, I'll admit that this interface is quite primitive; it was mainly designed for my own testing of the class library.


This is a 1.0 beta release; as far as I can tell, it appears to work acceptably, but there may be bugs I have not yet encountered in my own testing. Use at your own discretion. The build number for the compiled assembly included in this package is 1.0.3002.644.

If you have any comments, suggestions, or potential improvements, don't hesitate to drop me a line at codemanb@uga.edu.