Imbalanced data problem with rebalancing

Keywords: Imbalanced data, rebalancing, cost-sensitive learning

We investigated how to cope with Imbalanced data in previous article:

In the paper I introduced in previous article, The foundation of Cost-Sensitive Learning, Theorem 1 says that given cost matrix, we can learn the optimal number of negative samples. It means that we can deal with imbalanced data rebalancing the data. In this article, we ensured the performance through toy experiments. It’s almost same as previous one.

The result shows the effectiveness of rebalancing. However, it’s still low performance in the case that the data is quite imbalanced (e.g., p:n=1:100). These approaches are based on normal lerning technique i.e., classifier is still designed to maximize accuracy (precisely expected loss). That’s why the performance have limitations. Now we need some learning techniques maximizing F-measure or other scores directly. From next, we will try this kind of approaches. Thanks for reading🙂

The code:

from pylab import *
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.cross_validation import KFold

def execute():
    n=200; nte=int(0.1*n); k=5
    x,y=make_classification(n_samples=n,n_features=2,n_redundant=0,n_informative=2,n_clusters_per_class=1,n_classes=2,weights=[0.95,0.05])
    while sum(y)==0:
        x,y=make_classification(n_samples=n,n_features=2,n_redundant=0,n_informative=2,n_clusters_per_class=1,n_classes=2,weights=[0.95,0.05])
    xtr,ytr,xte,yte=x[:-nte],y[:-nte],x[-nte:],y[-nte:]
    while sum(yte)==0:
        idx=permutation(len(y))
        x=x[idx]; y=y[idx];
        xtr,ytr,xte,yte=x[:-nte],y[:-nte],x[-nte:],y[-nte:]
    best_score=-inf; best_p=None;
    kfold=KFold(len(ytr),k)
    for p in arange(0.1,1.0,0.1):
        scores=repeat(0.0,k)
        for i,(itr,ite) in enumerate(kfold):
            nn=int(len(ytr[itr][ytr[itr]==0])*p)
            Xtr=r_[xtr[itr][ytr[itr]==1],xtr[itr][ytr[itr]==0][:nn]]
            Ytr=concatenate([ytr[itr][ytr[itr]==1],ytr[itr][ytr[itr]==0][:nn]])
            clf=LogisticRegression().fit(Xtr,Ytr)
            ypred=1.0*(clf.predict_proba(xtr[ite])[:,1]>=p)
            try:
                scores[i]=f1_score(ytr[ite],ypred,average=None)[1]
            except IndexError:
                scores[i]=0.0
        this_score=mean(scores)
        if this_score>best_score:
            best_p=p; best_score=this_score;
    p=best_p
    nn=int(len(ytr[ytr==0])*p)
    Xtr=r_[xtr[ytr==1],xtr[ytr==0][:nn]]
    Ytr=concatenate([ytr[ytr==1],ytr[ytr==0][:nn]])
    clf=LogisticRegression().fit(Xtr,Ytr)
    ypred1=1.0*(clf.predict_proba(xte)[:,1]>=p)
    ypred2=clf.predict(xte)
    return f1_score(yte,ypred1,average=None)[1],f1_score(yte,ypred2,average=None)[1]

def main(nrun=1):
    scores=[]
    for i in xrange(nrun):
        scores.append(execute())
    scores=array(scores)
    means,medians,stds=mean(scores,0),median(scores,0),std(scores,0)
    print "Cost-Sensitive Logistic Regression"
    print "mean:",means[0]
    print "median:",medians[0]
    print "std:",stds[0]
    print "Ordinal Logistic Regression"
    print "mean:",means[1]
    print "median:",medians[1]
    print "std:",stds[1]

if __name__ == '__main__':
    main(nrun=100)

Result (F-measure, 100 runs):

Logistic Regression with rebalancing
mean: 0.734088023088
median: 1.0
std: 0.36292133247
Ordinal Logistic Regression
mean: 0.683904761905
median: 1.0
std: 0.398300687473

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中