# To cope with Imbalanced Data using F-measure

Keywords: Cost-Sensitive Learning, Imbalanced data, F-measure

In classification problem, we often encounter imbalanced data which means some classes have large amount of samples and some classes have small amount of examples. Imbalanced data especially appears in extraction problem such as keyword extraction and document summarization. Ordinal classification methods use accuracy for measuring the performance. It means that the methods is designed to maximize expected accuracy. In imbalanced data, such methods do not work well since suppose we have large amount of positive samples and small amount of negative samples in binary classification, then we can obtain high accuracy by classifying all samples into negative. Thus we need another measures to cope with this problem. F1 measure or F1 score(See http://en.wikipedia.org/wiki/F1_score) is often used in such case.

However how do we combine F1 measure with classification techniques? One answer is Cost-Sensitive Learning. The popular paper is:

Let us consider binary classification problem using discriminative model. We can find optimal threshold in the paper,

$p^{*} = \frac{c_{10}c_{00}}{c_{10}-c_{00}+c_{01}-c_{11}} .$

Usually, $c_{00}=0, c_{11}=0$. then we have optimal threshold as,

$p^{*} = \frac{c_{10}}{c_{10}+c_{01}} .$

But How do we define the cost-matrix? Now we can combine F1 measure with cost-matrix. In this article, we select cost-matrix maximizing F1 measure by cross-validation. To ensure the performance, we conduct toy experiment as below. We used Logistic Regression to estimate the conditional density. Our methods are “Cost-Sensitive Logistic Regression”. As we can see, the performance is better than ordinal one.

In this article, we investigate how to cope with imbalanced data. We outperform ordinal method by the concept of cost-sensitive learning. This method is quite simple and easy to implement. We will use this for real dataset such as keyword extraction next. Thanks for reading 🙂

The code (Sorry indent doesn’t work ;( ):

from pylab import *
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.datasets import make_classification
from sklearn.cross_validation import KFold

def execute():
n=200; nte=int(0.1*n); k=5
x,y=make_classification(n_samples=n,n_features=2,n_redundant=0,n_informative=2,n_clusters_per_class=1,n_classes=2,weights=[0.95,0.05])
while sum(y)==0:
x,y=make_classification(n_samples=n,n_features=2,n_redundant=0,n_informative=2,n_clusters_per_class=1,n_classes=2,weights=[0.9,0.1])
xtr,ytr,xte,yte=x[:-nte],y[:-nte],x[-nte:],y[-nte:]
while sum(yte)==0:
idx=permutation(len(y))
x=x[idx]; y=y[idx];
xtr,ytr,xte,yte=x[:-nte],y[:-nte],x[-nte:],y[-nte:]
best_score=-inf; best_p=None;
kfold=KFold(len(ytr),k)
for p in arange(0.1,1.0,0.05):
scores=repeat(0.0,k)
for i,(itr,ite) in enumerate(kfold):
clf=LogisticRegression().fit(xtr[itr],ytr[itr])
ypred=1.0*(clf.predict_proba(xtr[ite])[:,1]>=p)
try:
scores[i]=f1_score(ytr[ite],ypred,average=None)[1]
except IndexError:
scores[i]=0.0
this_score=mean(scores)
if this_score>best_score:
best_p=p; best_score=this_score;
p=best_p
clf=LogisticRegression().fit(xtr,ytr)
ypred1=1.0*(clf.predict_proba(xte)[:,1]>=p)
ypred2=clf.predict(xte)
return f1_score(yte,ypred1,average=None)[1],f1_score(yte,ypred2,average=None)[1]

def main(nrun=1):
scores=[]
for i in xrange(nrun):
scores.append(execute())
scores=array(scores)
means,medians,stds=mean(scores,0),median(scores,0),std(scores,0)
print "Cost-Sensitive Logistic Regression"
print "mean:",means[0]
print "median:",medians[0]
print "std:",stds[0]
print "Ordinal Logistic Regression"
print "mean:",means[1]
print "median:",medians[1]
print "std:",stds[1]

if __name__ == '__main__':
main(nrun=100)


The result (F-measure, 100 runs):

Cost-Sensitive Logistic Regression
mean: 0.695142857143
median: 0.857142857143
std: 0.372622002118
Ordinal Logistic Regression
mean: 0.59380952381
median: 0.666666666667
std: 0.436113790306