python - How to use the confusion matrix module in NLTK? -
i followed nltk book in using confusion matrix confusionmatrix looks odd.
#empirically exam tagger making mistakes test_tags = [tag sent in brown.sents(categories='editorial') (word, tag) in t2.tag(sent)] gold_tags = [tag (word, tag) in brown.tagged_words(categories='editorial')] print nltk.confusionmatrix(gold_tags, test_tags)
can explain how use confusion matrix?
firstly, assume got code old nltk
's chapter 05: https://nltk.googlecode.com/svn/trunk/doc/book/ch05.py, particularly you're @ section: http://pastebin.com/ec8ffqlu
now, let's @ confusion matrix in nltk
, try:
from nltk.metrics import confusionmatrix ref = 'det nn vb det jj nn nn in det nn'.split() tagged = 'det vb vb det nn nn nn in det nn'.split() cm = confusionmatrix(ref, tagged) print cm
[out]:
| d | | e j n v | | t n j n b | ----+-----------+ det |<3>. . . . | in | .<1>. . . | jj | . .<.>1 . | nn | . . .<3>1 | vb | . . . .<1>| ----+-----------+ (row = reference; col = test)
the numbers embedded in <>
true positives (tp). , example above, see 1 of jj
reference wrongly tagged nn
tagged output. instance, counts 1 false positive nn
, 1 false negative jj
.
to access confusion matrix (for calculating precision/recall/fscore), can access false negatives, false positives , true positives by:
labels = set('det nn vb in jj'.split()) true_positives = counter() false_negatives = counter() false_positives = counter() in labels: j in labels: if == j: true_positives[i] += cm[i,j] else: false_negatives[i] += cm[i,j] false_positives[j] += cm[i,j] print "tp:", sum(true_positives.values()), true_positives print "fn:", sum(false_negatives.values()), false_negatives print "fp:", sum(false_positives.values()), false_positives
[out]:
tp: 8 counter({'det': 3, 'nn': 3, 'vb': 1, 'in': 1, 'jj': 0}) fn: 2 counter({'nn': 1, 'jj': 1, 'vb': 0, 'det': 0, 'in': 0}) fp: 2 counter({'vb': 1, 'nn': 1, 'det': 0, 'jj': 0, 'in': 0})
to calculate fscore per label:
for in sorted(labels): if true_positives[i] == 0: fscore = 0 else: precision = true_positives[i] / float(true_positives[i]+false_positives[i]) recall = true_positives[i] / float(true_positives[i]+false_negatives[i]) fscore = 2 * (precision * recall) / float(precision + recall) print i, fscore
[out]:
det 1.0 in 1.0 jj 0 nn 0.75 vb 0.666666666667
i hope above de-confuse confusion matrix usage in nltk
, here's full code example above:
from collections import counter nltk.metrics import confusionmatrix ref = 'det nn vb det jj nn nn in det nn'.split() tagged = 'det vb vb det nn nn nn in det nn'.split() cm = confusionmatrix(ref, tagged) print cm labels = set('det nn vb in jj'.split()) true_positives = counter() false_negatives = counter() false_positives = counter() in labels: j in labels: if == j: true_positives[i] += cm[i,j] else: false_negatives[i] += cm[i,j] false_positives[j] += cm[i,j] print "tp:", sum(true_positives.values()), true_positives print "fn:", sum(false_negatives.values()), false_negatives print "fp:", sum(false_positives.values()), false_positives print in sorted(labels): if true_positives[i] == 0: fscore = 0 else: precision = true_positives[i] / float(true_positives[i]+false_positives[i]) recall = true_positives[i] / float(true_positives[i]+false_negatives[i]) fscore = 2 * (precision * recall) / float(precision + recall) print i, fscore
Comments
Post a Comment