Keywords: Cost-Sensitive Learning, Imbalanced data, F-measure
In classification problem, we often encounter imbalanced data which means some classes have large amount of samples and some classes have small amount of examples. Imbalanced data especially appears in extraction problem such as keyword extraction and document summarization. Ordinal classification methods use accuracy for measuring the performance. It means that the methods is designed to maximize expected accuracy. In imbalanced data, such methods do not work well since suppose we have large amount of positive samples and small amount of negative samples in binary classification, then we can obtain high accuracy by classifying all samples into negative. Thus we need another measures to cope with this problem. F1 measure or F1 score(See http://en.wikipedia.org/wiki/F1_score) is often used in such case.
However how do we combine F1 measure with classification techniques? One answer is Cost-Sensitive Learning. The popular paper is:
Let us consider binary classification problem using discriminative model. We can find optimal threshold in the paper,
Usually, . then we have optimal threshold as,
But How do we define the cost-matrix? Now we can combine F1 measure with cost-matrix. In this article, we select cost-matrix maximizing F1 measure by cross-validation. To ensure the performance, we conduct toy experiment as below. We used Logistic Regression to estimate the conditional density. Our methods are “Cost-Sensitive Logistic Regression”. As we can see, the performance is better than ordinal one.
In this article, we investigate how to cope with imbalanced data. We outperform ordinal method by the concept of cost-sensitive learning. This method is quite simple and easy to implement. We will use this for real dataset such as keyword extraction next. Thanks for reading🙂
The code (Sorry indent doesn’t work ;( ):
from pylab import * from sklearn.linear_model import LogisticRegression from sklearn.metrics import f1_score from sklearn.datasets import make_classification from sklearn.cross_validation import KFold def execute(): n=200; nte=int(0.1*n); k=5 x,y=make_classification(n_samples=n,n_features=2,n_redundant=0,n_informative=2,n_clusters_per_class=1,n_classes=2,weights=[0.95,0.05]) while sum(y)==0: x,y=make_classification(n_samples=n,n_features=2,n_redundant=0,n_informative=2,n_clusters_per_class=1,n_classes=2,weights=[0.9,0.1]) xtr,ytr,xte,yte=x[:-nte],y[:-nte],x[-nte:],y[-nte:] while sum(yte)==0: idx=permutation(len(y)) x=x[idx]; y=y[idx]; xtr,ytr,xte,yte=x[:-nte],y[:-nte],x[-nte:],y[-nte:] best_score=-inf; best_p=None; kfold=KFold(len(ytr),k) for p in arange(0.1,1.0,0.05): scores=repeat(0.0,k) for i,(itr,ite) in enumerate(kfold): clf=LogisticRegression().fit(xtr[itr],ytr[itr]) ypred=1.0*(clf.predict_proba(xtr[ite])[:,1]>=p) try: scores[i]=f1_score(ytr[ite],ypred,average=None) except IndexError: scores[i]=0.0 this_score=mean(scores) if this_score>best_score: best_p=p; best_score=this_score; p=best_p clf=LogisticRegression().fit(xtr,ytr) ypred1=1.0*(clf.predict_proba(xte)[:,1]>=p) ypred2=clf.predict(xte) return f1_score(yte,ypred1,average=None),f1_score(yte,ypred2,average=None) def main(nrun=1): scores= for i in xrange(nrun): scores.append(execute()) scores=array(scores) means,medians,stds=mean(scores,0),median(scores,0),std(scores,0) print "Cost-Sensitive Logistic Regression" print "mean:",means print "median:",medians print "std:",stds print "Ordinal Logistic Regression" print "mean:",means print "median:",medians print "std:",stds if __name__ == '__main__': main(nrun=100)
The result (F-measure, 100 runs):
Cost-Sensitive Logistic Regression mean: 0.695142857143 median: 0.857142857143 std: 0.372622002118 Ordinal Logistic Regression mean: 0.59380952381 median: 0.666666666667 std: 0.436113790306