python – LogisticRegression from sk_learn and smf.logit() from statsmodels.formula.api return different results

python – LogisticRegression from sk_learn and smf.logit() from statsmodels.formula.api return different results

Although you set the C parameter to be high to minimize, sklearn by default uses lbfgs solver to find your optimal parameters while statsmodels uses newton .

You can try doing this to get similar coefficients:

def boot_fn2(data):
    X = data[[income, balance]]
    y = data.default_01
    logit = LogisticRegression(penalty=none,max_iter=1000,solver = newton-cg)
    logit.fit(X, y)
    return logit.coef_

If I run this with the above function:

coef_sk = []
coef_sm = []
for _ in np.arange(50):
    data = boot(default_df)
    coef_sk.append(boot_fn2(data))
    coef_sm.append(boot_fn(data))

You will instantly see that it throws a lot of warnings about being unable to converge:

LineSearchWarning: The line search algorithm did not converge

Although the coefficients are similar now, it points to a larger issue with your dataset, similar to this question

np.array(coef_sm)[:,1:].mean(axis=0)
array([2.14570133e-05, 5.68280785e-03])

np.array(coef_sk).mean(axis=0)
array([[2.14352318e-05, 5.68116402e-03]])

Your dependent variables are quite huge and this poses a problem for the optimization methods available in sklearn. You can just scale two of your dependent variables down if you want to interpret the coefficients:

default_df[[balance,income]] = default_df[[balance,income]]/100

Else its always a good practice to scale your independent variables first and apply the regression:

from sklearn.preprocessing import StandardScaler
default_df[[balance,income]] = StandardScaler().fit_transform(default_df[[balance,income]])

def boot_fn(data):
    lr = smf.logit(formula=default_01 ~ income + balance, data=data).fit(disp=0)
    return lr.params
def boot_fn2(data):
    X = data[[income, balance]]
    y = data.default_01
    logit = LogisticRegression(penalty=none)
    logit.fit(X, y)
    return logit.coef_

coef_sk = []
coef_sm = []
for _ in np.arange(50):
    data = boot(default_df)
    #print(data.default_01.mean())
    coef_sk.append(boot_fn2(data))
    coef_sm.append(boot_fn(data))

Now youll see the coefficients are similar:

np.array(coef_sm)[:,1:].mean(axis=0)    
array([0.26517582, 2.71598194])

np.array(coef_sk).mean(axis=0)
array([[0.26517504, 2.71598548]])

python – LogisticRegression from sk_learn and smf.logit() from statsmodels.formula.api return different results

Leave a Reply

Your email address will not be published. Required fields are marked *