python 2.7 - Provoke the NLTK part-of-speech tagger to report a plural proper noun -


let's try out python's renouned part-of-speech tagger in nltk package.

import nltk # might need run nltk.download('maxent_treebank_pos_tagger')  #  after installing nltk  string = 'buddy billy went moon , came several vikings.' nltk.pos_tag(nltk.word_tokenize(string)) 

this gives me

[('buddy', 'nnp'), ('billy', 'nnp'), ('went', 'vbd'), ('to', 'to'), ('the', 'dt'), ('moon', 'nn'), ('and', 'cc'), ('came', 'vbd'), ('back', 'nnp'), ('with', 'in'), ('several', 'jj'), ('vikings', 'nns'), ('.', '.')]

you can interpret codes here. i'm disappointed 'back' got categorized proper noun (nnp), although confusion understandable. i'm more upset 'vikings' got called simple plural noun (nns) instead of plural proper noun (nnps). can come single example of brief input leads @ least 1 nnps tag?

there seems problems tags in nltk brown corpus tags nnps nps (possibly nltk tagset updated/outdated tags different https://www.ling.upenn.edu/courses/fall_2003/ling001/penn_treebank_pos.html)

here's example of plural proper nouns:

>>> nltk.corpus import brown >>> sent in brown.tagged_sents(): ...     if any(pos word, pos in sent if pos == 'nps'): ...             print sent ...             break ...  [(u'georgia', u'np'), (u'republicans', u'nps'), (u'are', u'ber'), (u'getting', u'vbg'), (u'strong', u'jj'), (u'encouragement', u'nn'), (u'to', u'to'), (u'enter', u'vb'), (u'a', u'at'), (u'candidate', u'nn'), (u'in', u'in'), (u'the', u'at'), (u'1962', u'cd'), (u"governor's", u'nn$'), (u'race', u'nn'), (u',', u','), (u'a', u'at'), (u'top', u'jjs'), (u'official', u'nn'), (u'said', u'vbd'), (u'wednesday', u'nr'), (u'.', u'.')] 

but if tag nltk.pos_tag, you'll nnps:

>>> sent in brown.tagged_sents(): ...     if any(pos word, pos in sent if pos == 'nps'): ...             print " ".join([word word, pos in sent]) ...             break ...  georgia republicans getting strong encouragement enter candidate in 1962 governor's race , top official said wednesday . >>> nltk import pos_tag >>> pos_tag("georgia republicans getting strong encouragement enter candidate in 1962 governor's race , top official said wednesday .".split()) [('georgia', 'nnp'), ('republicans', 'nnps'), ('are', 'vbp'), ('getting', 'vbg'), ('strong', 'jj'), ('encouragement', 'nn'), ('to', 'to'), ('enter', 'vb'), ('a', 'dt'), ('candidate', 'nn'), ('in', 'in'), ('the', 'dt'), ('1962', 'cd'), ("governor's", 'nns'), ('race', 'nn'), (',', ','), ('a', 'dt'), ('top', 'jj'), ('official', 'nn'), ('said', 'vbd'), ('wednesday', 'nnp'), ('.', '.')] 

Comments