Calculating high information words in Software Include 2d Data Matrix barcode in Software Calculating high information words

7. using software toincoporate data matrix with web,windows application Basice Features of 2D QR Codes Calculating high information words A high information word is a word that is strongly biased towards a single classification label. These are the kinds of words we saw when we called the show_most_informative_ features() method on both the NaiveBayesClassifier and the MaxentClassifier. Somewhat surprisingly, the top words are different for both classifiers.

This discrepancy is due to how each classifier calculates the significance of each feature, and it"s actually beneficial to have these different methods as they can be combined to improve accuracy, as we will see in the next recipe, Combining classifiers with voting. The low information words are words that are common to all labels. It may be counter-intuitive, but eliminating these words from the training data can actually improve accuracy, precision, and recall.

The reason this works is that using only high information words reduces the noise and confusion of a classifier"s internal model. If all the words/features are highly biased one way or the other, it"s much easier for the classifier to make a correct guess..

How to do it... First, we need to calculat e the high information words in the movie_review corpus. We can do this using the high_information_words() function in

from nltk.metrics import B 2d Data Matrix barcode for None igramAssocMeasures from nltk.probability import FreqDist, ConditionalFreqDist def high_information_words(labelled_words, score_ fn=BigramAssocMeasures.

chi_sq, min_score=5): word_fd = FreqDist() label_word_fd = ConditionalFreqDist() for label, words in labelled_words: for word in words: label_word_fd[label].inc(word) n_xx = label_word_fd.

N() high_info_words = set() for label in label_word_fd.conditions(): n_xi = label_word_fd[label].N() word_scores = collections.

defaultdict(int) for word, n_ii in label_word_fd[label].iteritems(): n_ix = word_fd[word] score = score_fn(n_ii, (n_ix, n_xi), n_xx) word_scores[word] = score bestwords = [word for word, score in word_scores.iteritems() if score >= min_score].

Text Classification high_info_words = set(bestwords) return high_info_words It takes one argument , wh Software DataMatrix ich is a list of 2-tuples of the form [(label, words)] where label is the classification label, and words is a list of words that occur under that label. It returns a list of the high information words, sorted from most informative to least informative. Once we have the high information words, we use the feature detector function bag_of_ words_in_set(), also found in featx.

py, which will let us filter out all low information words.. def bag_of_words_in_set(wo rds, goodwords): return bag_of_words(set(words) & set(goodwords)). With this new feature dete Software DataMatrix ctor, we can call label_feats_from_corpus() and get a new train_feats and test_feats using split_label_feats(). These two functions were covered in the Training a naive Bayes classifier recipe in this chapter..

>>> from featx im port high_information_words, bag_of_words_in_set >>> labels = movie_reviews.categories() >>> labeled_words = [(l, movie_reviews.words(categories=[l])) for l in labels] >>> high_info_words = set(high_information_words(labeled_words)) >>> feat_det = lambda words: bag_of_words_in_set(words, high_info_ words) >>> lfeats = label_feats_from_corpus(movie_reviews, feature_ detector=feat_det) >>> train_feats, test_feats = split_label_feats(lfeats).

Now that we have new train Software DataMatrix ing and testing feature sets, let"s train and evaluate a NaiveBayesClassifier:. >>> nb_classifier = NaiveBayesClassifier.train(train_feats) >>> accuracy(nb_classifier, test_feats) 0.91000000000000003 >>> nb_precisions, nb_recalls = precision_recall(nb_classifier, test_ feats) >>> nb_precisions["pos"] 0.

89883268482490275 >>> nb_precisions["neg"] 0.92181069958847739 >>> nb_recalls["pos"] 0.92400000000000004 >>> nb_recalls["neg"] 0.

89600000000000002. 7 . While the neg precision an d pos recall have both decreased somewhat, neg recall and pos precision have increased drastically. Accuracy is now a little higher than the MaxentClassifier..

How it works... The high_information_words () function starts by counting the frequency of every word, as well as the conditional frequency for each word within each label. This is why we need the words to be labelled, so we know how often each word occurs in each label. Once we have this FreqDist and ConditionalFreqDist, we can score each word on a per-label basis.

The default score_fn is nltk.metrics.BigramAssocMeasures.

chi_ sq(), which calculates the chi-square score for each word using the following parameters: 1. n_ii: The frequency of the word in the label. 2.

n_ix: The total frequency of the word across all labels. 3. n_xi: The total frequency of all words that occurred in the label.

4. n_xx: The total frequency for all words in all labels. The simplest way to think about these numbers is that the closer n_ii is to n_ix, the higher the score.

Or, the more often a word occurs in a label, relative to its overall occurrence, the higher the score. Once we have the scores for each word in each label, we can filter out all words whose score is below the min_score threshold. We keep the words that meet or exceed the threshold, and return all high scoring words in each label.

. It is recommended to exper Data Matrix ECC200 for None iment with different values of min_score to see what happens. In some cases, less words may improve the metrics even more, while in other cases more words is better..

Copyright © . All rights reserved.