Sunday 15 April 2012

scikit learn - Difference in SGD classifier results and statsmodels results for logistic with l1 -



scikit learn - Difference in SGD classifier results and statsmodels results for logistic with l1 -

as check on work, i've been comparing output of scikit learn's sgdclassifier logistic implementation statsmodels logistic. 1 time add together l1 in combination categorical variables, i'm getting different results. result of different solution techniques or not using right parameter?

much bigger differences on own dataset, still pretty big using mtcars:

df = sm.datasets.get_rdataset("mtcars", "datasets").data y, x = patsy.dmatrices('am~standardize(wt) + standardize(disp) + c(cyl) - 1', df) logit = sm.logit(y, x).fit_regularized(alpha=.0035) clf = sgdclassifier(alpha=.0035, penalty='l1', loss='log', l1_ratio=1, n_iter=1000, fit_intercept=false) clf.fit(x, y)

gives:

sklearn: [-3.79663192 -1.16145654 0.95744308 -5.90284803 -0.67666106] statsmodels: [-7.28440744 -2.53098894 3.33574042 -7.50604097 -3.15087396]

i've been working through similar issues. think short reply might sgd doesn't work few samples, (much more) performant larger data. i'd interested in hearing sklearn devs. compare, example, using logisticregression

clf2 = logisticregression(penalty='l1', c=1/.0035, fit_intercept=false) clf2.fit(x, y)

gives similar l1 penalized logit.

array([[-7.27275526, -2.52638167, 3.32801895, -7.50119041, -3.14198402]])

scikit-learn statsmodels

No comments:

Post a Comment