pattern.nl

The pattern.nl module contains a fast, regular expressions-based tagger/chunker for Dutch (identifies nouns, adjectives, verbs, etc. in a sentence) and tools for Dutch verb conjugation and noun singularization & pluralization.

It can be used by itself or with other pattern modules: web | db | en | search | vector | graph.


Documentation

The functions in this module take the same parameters and return the same values as their counterparts in pattern.en. Refer to the documentation there for more details.  

Noun singularization & pluralization

For Dutch nouns there is singularize() and pluralize(). The implementation is slightly less robust than the English version (accuracy 91% for singularization and 80% for pluralization).

>>> from pattern.nl import singularize, pluralize
>>> print singularize('katten')
>>> print pluralize('kat')

kat
katten 

Verb conjugation

For Dutch verbs there is conjugate(), lemma(), lexeme() and tenses(). The lexicon for verb conjugation contains about 4,000 common Dutch verbs; otherwise it will fall back to a rule-based approach with an accuracy of about 80%.

>>> from pattern.nl import conjugate, INFINITIVE
>>> print conjugate('was', tense=INFINITIVE)

zijn

Attributive & predicative adjectives 

Dutch adjectives followed by a noun inflect with an -e suffix (e.g., braaf → brave kat). You can get the base form with the predicative() function, or vice versa with attributive(). Accuracy is 99%.

>>> from pattern.nl import attributive, predicative
>>> print predicative('brave') 
>>> print attributive('braaf') 

braaf
brave 

Sentiment analysis

For opinion mining there is sentiment(), which returns a (polarity, subjectivity)-tuple, based on a lexicon of adjectives. It has an accuracy of 81% (P 0.77, R 0.84) for book reviews:

>>> from pattern.nl import sentiment
>>> print sentiment('Een onwijs spannend goed boek!')

(0.55, 0.90) 

Parser

For parsing there is parse(), parsetree() and split(). Words processed with parse() are assigned tags such as NN (nouns) or VB (verbs). See the pattern.en documentation (here) how to manipulate Sentence objects returned from split()

>>> from pattern.nl import parse, split
>>> s = parse('De kat zit op de mat.')
>>> s = split(s)
>>> print s.sentences[0]

Sentence('De/DT/B-NP/O kat/NN/I-NP/O zit/VBZ/B-VP/O op/IN/B-PP/B-PNP'
         'de/DT/B-NP/I-PNP mat/NN/I-NP/I-PNP ././O/O')

The parser is built on Jeroen Geertzen's Dutch language model. The accuracy is reported around 92%, but the score for the implementation in Pattern can vary slightly from Geertzen's results, since the original WOTAN tagset is mapped to Penn Treebank. If you need to work with the original tags you can also use parse() with an optional parameter tagset="WOTAN".

Reference: Geertzen, J. (2010), Brill-NL, http: //cosmion.net/jeroen/software/brill_pos/