In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
from scipy.special import expit
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

sns.set(style='ticks', font_scale=1.5)
blue, green, red, purple = sns.color_palette('deep', n_colors=4)

More regression, classification, etc

Last week, we looked at how to fit and evaluate linear regression and logistic regression models using the statsmodels package. This week, we will explore the scikit-learn, or sklearn package. The two packages overlap a fair amount, at least with respect to regression and logistic regression models, and there are similarities and differences between the two packages' approach to these sorts of models.

First, we'll generate some random data with known $\beta$ coefficients so that we can evaluate how well the modeling algorithms do recovering the model parameters.

We can use np.random.seed() to reproduce (pseudo-)random numbers so that we can reproduce our results as needed. Setting the seed will cause the random number generator to produce the same sequence of pseudo-random numbers.

So, for example, if we set the seed equal to 6 and then generate five normal random variates, we get:

In [2]:
np.random.seed(seed=6)
np.random.normal(size=5)
Out[2]:
array([-0.31178367,  0.72900392,  0.21782079, -0.8990918 , -2.48678065])

Then if we set the seed to 7, we get:

In [3]:
np.random.seed(seed=7)
np.random.normal(size=5)
Out[3]:
array([ 1.6905257 , -0.46593737,  0.03282016,  0.40751628, -0.78892303])

If we go back and set it to 6 again, we get the same thing we got when we set it to 6 the first time:

In [4]:
np.random.seed(seed=6)
np.random.normal(size=5)
Out[4]:
array([-0.31178367,  0.72900392,  0.21782079, -0.8990918 , -2.48678065])

We will create give random variables with different distributions, just to make things (moderately) interesting. We'll make one normal random variable, one laplace, one triangular, one Von Mises, and one logistic:

In [5]:
np.random.seed(seed=55555)
nsmp = 150 # number of "observations"

# generate random numbers and stick them together into an array
X = np.stack((stats.norm.rvs(size=nsmp, loc=0, scale=1),
              stats.laplace.rvs(size=nsmp, loc=0, scale=1),
              stats.triang.rvs(loc=-2.5, c=.5, scale=5, size=nsmp),
              stats.cosine.rvs(loc=0,scale=2,size=nsmp),
              stats.logistic.rvs(loc=0,scale=1,size=nsmp)),axis=1)

col_names = ['x1','x2','x3','x4','x5']

# make a data frame
df_X = pd.DataFrame(X,columns=col_names,index=np.arange(nsmp))
df_X.head()
Out[5]:
x1 x2 x3 x4 x5
0 0.315285 -0.003207 -1.437216 -4.508314 -1.146725
1 0.245975 -0.282873 -1.785265 0.951078 -2.183040
2 2.171328 1.304174 0.368489 -1.879425 1.929869
3 -1.454182 0.156729 -0.391970 4.242391 -0.164466
4 0.101272 0.132085 1.531524 -2.641646 2.310899

Here is what the theoretical probability density functions look like for the random variables we generated:

In [6]:
fig, axes = plt.subplots(1,5,figsize=(20,4))
xax = np.linspace(-5,5,1000)
axes[0].plot(xax,stats.norm.pdf(xax, loc=0, scale=1))
axes[1].plot(xax,stats.laplace.pdf(xax, loc=0, scale=1))
axes[2].plot(xax,stats.triang.pdf(xax, loc=-2.5, c=.5, scale=5))
axes[3].plot(xax,stats.cosine.pdf(xax, loc=0, scale=1))
axes[4].plot(xax,stats.logistic.pdf(xax, loc=0, scale=1))
[axes[i].set(yticks=[]) for i in range(5)];

We can create a matrix of scatter plots to see what the data looks like (and how much they diverge from the theoretical pdfs above). Because we created each random variable independently of each other, the variables are not highly correlated.

np.corrcoef(array) gives you a correlation matrix for the variables in the input array.

In [7]:
# corrcoef assumes variables are in rows, so we use X.T
print(np.corrcoef(X.T).round(2))
[[ 1.    0.26 -0.16  0.21 -0.03]
 [ 0.26  1.   -0.01 -0.01 -0.13]
 [-0.16 -0.01  1.   -0.08  0.2 ]
 [ 0.21 -0.01 -0.08  1.    0.07]
 [-0.03 -0.13  0.2   0.07  1.  ]]

We can visualize the data with a matrix of scatter plots:

In [8]:
sns.pairplot(df_X,diag_kind='kde');

Next, we will create an array of $\beta$ coefficients, an error standard deviation $\sigma$, and use these with our X data frame to generate a dependent variable y:

In [9]:
B = np.array([17.5, -9.3, 2.4, -8.8, 11.2])
sigma = 3.2
mu_y = X @ B
y = np.random.normal(loc=mu_y, scale=sigma, size=nsmp)

Then we can visualize how y relates to each of our random x variables. Note that the linear model illustrated in each panel is for a model with just that variable used as a predictor (i.e., it's not the full model we'll fit below):

In [10]:
fig, axes = plt.subplots(1, 5, figsize=(20,4))
axes[0].set_ylabel('y')

for i in range(5):
    sns.regplot(x=X[:,i],y=y,ax=axes[i], n_boot=100)
    axes[i].set_xlabel(col_names[i])
    if i > 0:
        axes[i].set_yticklabels([])

Next, we use the linear models that we imported from sklearn. More specifically, we'll fit linear regression and logistic regression models first, and we'll come back to ridge regression and the lasso model in a bit.

First, we'll fit a standard linear regression model, and check to see how the fitted parameter ($\beta$) estimates compare to the $\beta$ values we used to generate the data.

As with statsmodels, we create a model object (with some input arguments, as appropriate), and then we use the fit() method with data as input arguments:

In [11]:
lin_mod = LinearRegression(fit_intercept=False)
lin_fit = lin_mod.fit(X,y)

Here are the estimated parameters ($\beta$ coefficients):

In [12]:
np.round(lin_fit.coef_,1)
Out[12]:
array([ 17.7,  -9.3,   2.7,  -8.7,  11. ])

And here are the values we used to generate the data above:

In [13]:
B
Out[13]:
array([ 17.5,  -9.3,   2.4,  -8.8,  11.2])

So, in this case, at least, the model did a good job recovering the known parameter values.

We can generate random data with the same parameters repeatedly to measure how much variation there is in fitted model parameters across different random samples:

In [14]:
nsim = 250 # number of simulations
nprd = len(B) # number of predictors
pars = np.zeros((nsim,nprd)) # container for fitted coefficients

for i in range(nsim):
    # generate a new random X matrix for simulation i
    Xt = np.stack((stats.norm.rvs(size=nsmp, loc=0, scale=1),
                   stats.laplace.rvs(size=nsmp, loc=0, scale=1),
                   stats.triang.rvs(loc=-2.5, c=.5, scale=5, size=nsmp),
                   stats.cosine.rvs(loc=0,scale=2,size=nsmp),
                   stats.logistic.rvs(loc=0,scale=1,size=nsmp)),axis=1)
    # generate a new random y vector for simulation i
    yt = np.random.normal(loc=Xt @ B, scale=sigma, size=nsmp)
    # fit a linear model with simulated Xt and yt
    lf_t = lin_mod.fit(Xt,yt)
    # store the fitted parameters/coefficients from the model
    pars[i,:] = lf_t.coef_

We can use seaborn's distplot() to visualize the distributions of $\beta$ estimates:

In [16]:
fig, axes = plt.subplots(1,nprd,figsize=(16,6), sharex=True)
[sns.distplot(pars[:,i]-B[i],ax=axes[i],bins=25) for i in range(nprd)]
indices = [r'$\beta_' + str(i) + '$' for i in range(1,nprd+1)]
[axes[i].set(xlabel=indices[i],yticks=[]) for i in range(nprd)];

Next, in order to illustrate fitting a logistic regression model, we'll make a dichotomous outcome variable, fit a logistic regression model, and check how the fitted parameters compare to the true parameters:

In [19]:
# expit function for turning X @ B into probabilities
#  expit(x) = 1/(1+np.exp(-x))
Bsc = B/10 # scale the coefficients
p = expit(X @ Bsc) # generate predicted probabilities
b = (p > .57)*1 # dichotomize p
p[:10]
Out[19]:
array([ 0.94748856,  0.04666864,  0.99848485,  0.00122676,  0.9952023 ,
        0.96561392,  0.33271521,  0.32158936,  0.79754037,  0.26000141])
In [20]:
b[:10]
Out[20]:
array([1, 0, 1, 0, 1, 1, 0, 0, 1, 0])
In [21]:
log_mod = LogisticRegression(fit_intercept=False) # logistic regression model
log_fit = log_mod.fit(X,b) # fit the model
In [22]:
log_fit.coef_[0].round(2)
Out[22]:
array([ 2.56, -1.49,  0.54, -1.45,  2.22])
In [23]:
Bsc
Out[23]:
array([ 1.75, -0.93,  0.24, -0.88,  1.12])

Using a sequence of regplots, we can see what the data looks like (with fitted univariate logistic model predictions):

In [24]:
fig, axes = plt.subplots(1,5,figsize=(16,4))
axes[0].set_ylabel('y')
[sns.regplot(x=X[:,i],y=b,ax=axes[i], logistic=True, n_boot=100) for i in range(5)]
[axes[i].set_yticklabels([]) for i in range(1,5)]
[axes[i].set_xlabel(col_names[i]) for i in range(5)];

As with the linear regression model above, we can simulate data and quantify how variable and how consistently biased the logistic regression estimates are:

In [25]:
nsim = 250
nprd = len(B)
pars = np.zeros((nsim,nprd))

for i in range(nsim):
    Xt = np.stack((stats.norm.rvs(size=nsmp, loc=0, scale=1),
                   stats.laplace.rvs(size=nsmp, loc=0, scale=1),
                   stats.triang.rvs(loc=-2.5, c=.5, scale=5, size=nsmp),
                   stats.cosine.rvs(loc=0,scale=2,size=nsmp),
                   stats.logistic.rvs(loc=0,scale=1,size=nsmp)),axis=1)
    pt = expit(Xt @ Bsc)
    bt = (p > .57)*1
    lf_t = log_mod.fit(Xt,bt)
    pars[i,:] = lf_t.coef_[0]

And we can visualize them as we did above (more or less):

In [26]:
fig, axes = plt.subplots(1,nprd,figsize=(16,6), sharex=True)
xax_lims = (1.1*np.min(pars-Bsc),1.1*np.max(pars-Bsc))
[sns.distplot(pars[:,i]-Bsc[i],ax=axes[i],bins=25) for i in range(nprd)]
indices = [r'$\beta_' + str(i) + '$' for i in range(1,nprd+1)]
[axes[i].set(xlabel=indices[i],yticks=[],xlim=xax_lims) for i in range(nprd)];

Standard linear regression seems to do a good job recovering the true parameter values, but logistic regression doesn't. It gets the sign right, and it's very precise, but it's inaccurate. In statistical terminology, it exhibits fairly substantial bias.

Note that this doesn't imply that logistic regression is a bad tool. The way we generated our dichotomous dependent variable was unusual, plausibly not accurately reflecting any mechanisms we might encounter in the world, for example.

As discussed above, the random variables we've been dealing with so far are not particularly highly correlated. Let's see what happens when we have correlated independent variables. To do this, we'll generate multivariate normal data with some substantial correlations:

In [28]:
np.random.seed(seed=55562) # set rng seed
nsmp = 150 # number of "observations"
mu = np.zeros(5) # mean vector
S = np.linalg.inv(stats.wishart.rvs(scale=np.eye(5)/10,df=6,size=1)) # covariance matrix
X = np.random.multivariate_normal(mean=mu, cov=S, size=nsmp)
df_X = pd.DataFrame(X,columns=col_names,index=np.arange(nsmp))
np.corrcoef(X.T).round(2)
Out[28]:
array([[ 1.  ,  0.71, -0.24,  0.47,  0.71],
       [ 0.71,  1.  ,  0.14,  0.89,  0.86],
       [-0.24,  0.14,  1.  ,  0.43, -0.14],
       [ 0.47,  0.89,  0.43,  1.  ,  0.68],
       [ 0.71,  0.86, -0.14,  0.68,  1.  ]])
In [29]:
sns.pairplot(df_X,diag_kind='kde');

We'll use the same $\beta$ and $\sigma$ parameters that we used above:

In [30]:
B
Out[30]:
array([ 17.5,  -9.3,   2.4,  -8.8,  11.2])
In [31]:
sigma
Out[31]:
3.2
In [32]:
y = np.random.normal(loc=X @ B, scale=sigma, size=nsmp)

We can visualize the relationships between y and each independent variable (again, keeping in mind that the regression line in each panel is fit just to the two variables in the panel, not the full set of predictors):

In [33]:
fig, axes = plt.subplots(1,5,figsize=(20,4))
axes[0].set_ylabel('y')
[sns.regplot(x=X[:,i],y=y,ax=axes[i], n_boot=100) for i in range(5)]
[axes[i].set_yticklabels([]) for i in range(1,5)]
[axes[i].set_xlabel(col_names[i]) for i in range(5)];

As above, the standard linear model does a good job of recovering the true parameter values:

In [35]:
lin_mod = LinearRegression(fit_intercept=False)
lin_fit = lin_mod.fit(X,y)
In [36]:
lin_fit.coef_.round(2)
Out[36]:
array([ 17.9 ,  -9.31,   2.46,  -8.84,  11.15])
In [37]:
B
Out[37]:
array([ 17.5,  -9.3,   2.4,  -8.8,  11.2])

We won't bother simulating multiple data sets to check the variability of the estimates this time.

Lasso and Ridge Regression and regularization

As briefly discussed last week, lasso regression is a tool for regularization - roughly, putting pressure on a model to constrain its complexity. This can help with making more reliable predictions, though it makes it difficult (if not impossible) to interpret the parameter estimates we get from a fitted model.

In sklearn as in statsmodels, fitting a lasso model is just like fitting a regular linear regression model, but we also need to specify an $\alpha$ parameter. Larger $\alpha$ values put more pressure on the model to keep the size of the coefficients small.

In [38]:
lasso_mod = Lasso(alpha=10, fit_intercept=False)
lasso_fit = lasso_mod.fit(X,y)
In [39]:
np.round(lasso_fit.coef_,1)
Out[39]:
array([ 11. ,  -4.6,  -0. ,  -9.9,   2.8])
In