python - How to use the confusion matrix module in NLTK? -

i followed nltk book in using confusion matrix confusionmatrix looks odd.

#empirically exam tagger making mistakes test_tags = [tag sent in brown.sents(categories='editorial')     (word, tag) in t2.tag(sent)] gold_tags = [tag (word, tag) in brown.tagged_words(categories='editorial')] print nltk.confusionmatrix(gold_tags, test_tags)

can explain how use confusion matrix?

firstly, assume got code old nltk's chapter 05: https://nltk.googlecode.com/svn/trunk/doc/book/ch05.py, particularly you're @ section: http://pastebin.com/ec8ffqlu

now, let's @ confusion matrix in nltk, try:

from nltk.metrics import confusionmatrix ref  = 'det nn vb det jj nn nn in det nn'.split() tagged = 'det vb vb det nn nn nn in det nn'.split() cm = confusionmatrix(ref, tagged) print cm

[out]:

    | d         |     | e j n v |     | t n j n b | ----+-----------+ det |<3>. . . . |  in | .<1>. . . |  jj | . .<.>1 . |  nn | . . .<3>1 |  vb | . . . .<1>| ----+-----------+ (row = reference; col = test)

the numbers embedded in <> true positives (tp). , example above, see 1 of jj reference wrongly tagged nn tagged output. instance, counts 1 false positive nn , 1 false negative jj.

to access confusion matrix (for calculating precision/recall/fscore), can access false negatives, false positives , true positives by:

labels = set('det nn vb in jj'.split())  true_positives = counter() false_negatives = counter() false_positives = counter()  in labels:     j in labels:         if == j:             true_positives[i] += cm[i,j]         else:             false_negatives[i] += cm[i,j]             false_positives[j] += cm[i,j]  print "tp:", sum(true_positives.values()), true_positives print "fn:", sum(false_negatives.values()), false_negatives print "fp:", sum(false_positives.values()), false_positives

[out]:

tp: 8 counter({'det': 3, 'nn': 3, 'vb': 1, 'in': 1, 'jj': 0}) fn: 2 counter({'nn': 1, 'jj': 1, 'vb': 0, 'det': 0, 'in': 0}) fp: 2 counter({'vb': 1, 'nn': 1, 'det': 0, 'jj': 0, 'in': 0})

to calculate fscore per label:

for in sorted(labels):     if true_positives[i] == 0:         fscore = 0     else:         precision = true_positives[i] / float(true_positives[i]+false_positives[i])         recall = true_positives[i] / float(true_positives[i]+false_negatives[i])         fscore = 2 * (precision * recall) / float(precision + recall)     print i, fscore

[out]:

det 1.0 in 1.0 jj 0 nn 0.75 vb 0.666666666667

i hope above de-confuse confusion matrix usage in nltk, here's full code example above:

from collections import counter nltk.metrics import confusionmatrix  ref  = 'det nn vb det jj nn nn in det nn'.split() tagged = 'det vb vb det nn nn nn in det nn'.split() cm = confusionmatrix(ref, tagged)  print cm  labels = set('det nn vb in jj'.split())  true_positives = counter() false_negatives = counter() false_positives = counter()  in labels:     j in labels:         if == j:             true_positives[i] += cm[i,j]         else:             false_negatives[i] += cm[i,j]             false_positives[j] += cm[i,j]  print "tp:", sum(true_positives.values()), true_positives print "fn:", sum(false_negatives.values()), false_negatives print "fp:", sum(false_positives.values()), false_positives print   in sorted(labels):     if true_positives[i] == 0:         fscore = 0     else:         precision = true_positives[i] / float(true_positives[i]+false_positives[i])         recall = true_positives[i] / float(true_positives[i]+false_negatives[i])         fscore = 2 * (precision * recall) / float(precision + recall)     print i, fscore

Search This Blog

WIKI

python - How to use the confusion matrix module in NLTK? -

Comments

Post a Comment