In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set(style='darkgrid', font_scale=2)
blue, green, red, purple = sns.color_palette('deep', n_colors=4)
/Users/noahsilbert/anaconda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Data analysis: linear models, generalized linear models, etc

There are a number of packages designed for model fitting and evaluation in Python. Two in particular come bundled with Anaconda: statsmodels and scikit-learn. We'll work with parts of each, beginning with statsmodels. We will (probably) also work with pystan later on, when we get to Bayesian data analysis.

First, note the slightly different import statement in the code block above. In order to import and use statsmodels, we need to import the api (application program interface). The standard abbreviation is sm. Note also that we've imported statsmodels.formula.api as smf.

We can illustrate some of the useful tools in sm with the consonant acoustics data we have looked at before. Because there are multiple repetitions of each consonant by each speaker, we will just take a subset of the data (in this case, the first repetition).

In [3]:
cons_all = pd.read_csv('consonant_acoustics_subset_labels.csv') # read in data file
cons_all.sort_values(by=['subject','consonant','slab','token'], inplace=True) # sort the data
cons_first = cons_all.groupby(['subject','consonant','slab']).first() # take just the first repetition
cons_first.head(n=10)
Out[3]:
voice place manner prosody condur vowdur token noipow voipow vlab plab mlab
subject consonant slab
1 b coda 1 0 0 1 108 214 1 142.09 117.84 voiced labial stop
onset 1 0 0 0 15 184 1 138.55 119.30 voiced labial stop
d coda 1 1 0 1 96 226 1 125.93 117.03 voiced alveolar stop
onset 1 1 0 0 17 208 1 140.72 119.62 voiced alveolar stop
f coda 0 0 1 1 220 168 1 142.80 111.24 voiceless labial fricative
onset 0 0 1 0 126 209 1 152.67 118.22 voiceless labial fricative
p coda 0 0 0 1 143 168 1 145.16 120.06 voiceless labial stop
onset 0 0 0 0 39 166 1 140.55 120.55 voiceless labial stop
s coda 0 1 1 1 215 193 1 152.83 113.49 voiceless alveolar fricative
onset 0 1 1 0 159 166 1 162.08 109.67 voiceless alveolar fricative
In [4]:
cons_first.reset_index(level=['subject','consonant','slab'], inplace=True) # make index into columns
cons_first.head(n=10)
Out[4]:
subject consonant slab voice place manner prosody condur vowdur token noipow voipow vlab plab mlab
0 1 b coda 1 0 0 1 108 214 1 142.09 117.84 voiced labial stop
1 1 b onset 1 0 0 0 15 184 1 138.55 119.30 voiced labial stop
2 1 d coda 1 1 0 1 96 226 1 125.93 117.03 voiced alveolar stop
3 1 d onset 1 1 0 0 17 208 1 140.72 119.62 voiced alveolar stop
4 1 f coda 0 0 1 1 220 168 1 142.80 111.24 voiceless labial fricative
5 1 f onset 0 0 1 0 126 209 1 152.67 118.22 voiceless labial fricative
6 1 p coda 0 0 0 1 143 168 1 145.16 120.06 voiceless labial stop
7 1 p onset 0 0 0 0 39 166 1 140.55 120.55 voiceless labial stop
8 1 s coda 0 1 1 1 215 193 1 152.83 113.49 voiceless alveolar fricative
9 1 s onset 0 1 1 0 159 166 1 162.08 109.67 voiceless alveolar fricative
In [5]:
cols = ['voice','vlab','place','plab','manner','mlab','prosody','slab','noipow','voipow','condur','vowdur','token']
cons = cons_first[cols].copy() # make a copy, with just the subset of columns in cols
cons.head(n=10)
Out[5]:
voice vlab place plab manner mlab prosody slab noipow voipow condur vowdur token
0 1 voiced 0 labial 0 stop 1 coda 142.09 117.84 108 214 1
1 1 voiced 0 labial 0 stop 0 onset 138.55 119.30 15 184 1
2 1 voiced 1 alveolar 0 stop 1 coda 125.93 117.03 96 226 1
3 1 voiced 1 alveolar 0 stop 0 onset 140.72 119.62 17 208 1
4 0 voiceless 0 labial 1 fricative 1 coda 142.80 111.24 220 168 1
5 0 voiceless 0 labial 1 fricative 0 onset 152.67 118.22 126 209 1
6 0 voiceless 0 labial 0 stop 1 coda 145.16 120.06 143 168 1
7 0 voiceless 0 labial 0 stop 0 onset 140.55 120.55 39 166 1
8 0 voiceless 1 alveolar 1 fricative 1 coda 152.83 113.49 215 193 1
9 0 voiceless 1 alveolar 1 fricative 0 onset 162.08 109.67 159 166 1

The sm module makes it easy to specify and fit linear regression models. You can use R-style syntax with a pandas data frame or with your data in numpy arrays. Here's an example using R-style syntax and the data frame defined above.

We will specify and fit a simple linear regression model to analyze how well the duration of consonants predicts their RMS energy.

We'll visualize the data before fitting a model:

In [6]:
sns.jointplot(cons['condur'], cons['noipow'], size=8, stat_func=None);

In order to fit a standard linear regression model, we can use the OLS (ordinary least squares) method inside the sm module to create a model object, and then we can call the fit method of that model object. The OLS method requires arrays for the dependent variable (endog, for 'endogenous') and the independent variable(s) (exog, for 'exogenous').

The term 'ordinary least squares' refers to the fact that the model is fit (i.e., parameter values are found) such that the squared deviations of the observed data from the expected value is minimized.

Note that you have to include a column of ones if you want the model to have an intercept.

The model we'll fit is:

\begin{equation} \hat{y}_i = \beta_0 + \beta_1 x_{i} \end{equation}

In [9]:
y = cons['noipow'].as_matrix() # noipow = dependent/endogenous variable
x0 = np.ones(y.shape[0]) # for just estimating an intercept
x = np.stack((np.ones(y.shape[0]),cons['condur']),axis=1) # ones for intercept, condur = ind/exog variable
x[:5,:]
Out[9]:
array([[   1.,  108.],
       [   1.,   15.],
       [   1.,   96.],
       [   1.,   17.],
       [   1.,  220.]])

Simple, intercept-only model (i.e., calculating the mean and variance of y in an unnecessarily complicated way):

In [10]:
mod_0 = sm.OLS(endog=y, exog=x0) # intercept-only model
fit_0 = mod_0.fit()
print(fit_0.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                      -0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                      -inf
Date:                Fri, 23 Mar 2018   Prob (F-statistic):                nan
Time:                        10:50:29   Log-Likelihood:                -1198.2
No. Observations:                 320   AIC:                             2398.
Df Residuals:                     319   BIC:                             2402.
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        145.1809      0.573    253.458      0.000     144.054     146.308
==============================================================================
Omnibus:                        5.699   Durbin-Watson:                   1.125
Prob(Omnibus):                  0.058   Jarque-Bera (JB):                3.849
Skew:                          -0.105   Prob(JB):                        0.146
Kurtosis:                       2.505   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/Users/noahsilbert/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py:1396: RuntimeWarning: divide by zero encountered in double_scalars
  return self.ess/self.df_model

Adding in consonant duration as a predictor to see how much information it gives us about acoustic energy above and beyond just calculating the mean (and variance) of y:

In [8]:
mod_a = sm.OLS(endog=y, exog=x) # OLS model object
fit_a = mod_a.fit()
print(fit_a.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     16.55
Date:                Fri, 23 Mar 2018   Prob (F-statistic):           5.97e-05
Time:                        10:46:40   Log-Likelihood:                -1190.1
No. Observations:                 320   AIC:                             2384.
Df Residuals:                     318   BIC:                             2392.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        140.8704      1.198    117.585      0.000     138.513     143.228
x1             0.0313      0.008      4.069      0.000       0.016       0.046
==============================================================================
Omnibus:                        3.950   Durbin-Watson:                   1.278
Prob(Omnibus):                  0.139   Jarque-Bera (JB):                3.075
Skew:                          -0.116   Prob(JB):                        0.215
Kurtosis:                       2.579   Cond. No.                         333.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Or, somewhat more conveniently, we can use the ols method inside smf. Using this with formula and data specified creates a model object, which again requires us to call the fit method:

In [20]:
fit_b = smf.ols('noipow ~ condur', data=cons).fit()
print(fit_b.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 noipow   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     16.55
Date:                Fri, 23 Mar 2018   Prob (F-statistic):           5.97e-05
Time:                        11:04:55   Log-Likelihood:                -1190.1
No. Observations:                 320   AIC:                             2384.
Df Residuals:                     318   BIC:                             2392.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    140.8704      1.198    117.585      0.000     138.513     143.228
condur         0.0313      0.008      4.069      0.000       0.016       0.046
==============================================================================
Omnibus:                        3.950   Durbin-Watson:                   1.278
Prob(Omnibus):                  0.139   Jarque-Bera (JB):                3.075
Skew:                          -0.116   Prob(JB):                        0.215
Kurtosis:                       2.579   Cond. No.                         333.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can use the seaborn function lmplot to plot the data and the model's fitted values along with a bootstrap confidence interval around the fitted line:

In [13]:
y_mean = fit_0.params
In [18]:
sns.lmplot(x='condur', y='noipow', data=cons, size=8)
ax = plt.gca()
ax.plot(ax.get_xlim(),y_mean*np.ones(2),':',color=purple, lw=3);

We could also plot the 'lowess' (locally weighted) regression line, instead:

In [10]:
sns.lmplot(x='condur',y='noipow', data=cons, size=8, lowess=True);

The fitted model also allows us to generate predicted mean values and standard errors around these predicted mean values for values of the independent variable of interest:

In [24]:
# create data frame with values of 'condur' at which we want predictions
x_new_d = pd.DataFrame({'condur':np.arange(0,751,50)})

# make array for plotting
x_new = x_new_d['condur'].as_matrix()

# generate predictions of noise power at new 'condur' values
y_pred = fit_b.get_prediction(x_new_d)

fig, ax = plt.subplots(1,1,figsize=(8,8))
ax.set_xlabel('condur'); ax.set_ylabel('noipow');
ax.plot(cons['condur'],cons['noipow'],'o',color=list(blue) + [.75])

y_pred_mean = y_pred.predicted_mean # predicted mean
y_pred_se = y_pred.se_mean # prediected standard error
y1 = y_pred_mean-2*y_pred_se; y2 = y_pred_mean+2*y_pred_se # prediction bounds

ax.fill_between(x=x_new,y1=y1,y2=y2,color=list(blue) + [.25])
ax.plot(x_new,y_pred_mean,'-',lw=3,color=list(blue) + [.95]);

Multiple regression

Of course, we can also fit more complicated models. Here is a model predicting consonant duration as a function of vowel duration, syllable position (slab: onset vs coda), and the interaction between these two factors.

Note that, because syllable position is a dichotomous variable, the presence of this variable in the model allows the onset and coda data to have different intercepts, and the interaction term allows for vowel duration to have a different slope for onset and coda positions.

If syllable position is indicated by $S$ and encoded with onset = 1 and coda = 0, with estimated consonant duration indicated by $\hat{C}$ and vowel duration by $V$, then the model is:

\begin{equation} \hat{C}_i = \beta_0 + \beta_1 V_i + \beta_2 S_i + \beta_3 V_i \times S_i \end{equation}

For coda consonants, $S_i = 0$, so this reduces to:

\begin{equation} \hat{C}_i = \beta_0 + \beta_1 V_i \end{equation}

Whereas for onset consonants, $S_i = 1$, so the model is:

\begin{align} \hat{C}_i &= \beta_0 + \beta_1 V_i + \beta_2 S_i + \beta_3 V_i \times S_i\\ &= \beta_0 + \beta_1 V_i + \beta_2 + \beta_3 V_i\\ &= \left(\beta_0 + \beta_2 \right) + \left(\beta_1 + \beta_3\right) V_i \end{align}

Here's the model, using a formula and the smf module:

In [28]:
fit_c = smf.ols('condur ~ vowdur + slab + vowdur:slab', data=cons).fit()
print(fit_c.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 condur   R-squared:                       0.442
Model:                            OLS   Adj. R-squared:                  0.437
Method:                 Least Squares   F-statistic:                     83.47
Date:                Fri, 23 Mar 2018   Prob (F-statistic):           8.55e-40
Time:                        11:12:11   Log-Likelihood:                -1732.2
No. Observations:                 320   AIC:                             3472.
Df Residuals:                     316   BIC:                             3487.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept              225.9845     18.838     11.996      0.000     188.921     263.048
slab[T.onset]         -154.6420     28.274     -5.469      0.000    -210.271     -99.013
vowdur                  -0.1658      0.075     -2.213      0.028      -0.313      -0.018
vowdur:slab[T.onset]     0.2382      0.110      2.164      0.031       0.022       0.455
==============================================================================
Omnibus:                        6.439   Durbin-Watson:                   1.529
Prob(Omnibus):                  0.040   Jarque-Bera (JB):                5.942
Skew:                           0.276   Prob(JB):                       0.0513
Kurtosis:                       2.624   Cond. No.                     3.06e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.06e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

We can parameterize the model in terms of the overall intercept vs deviations from this for coda and onset consonants.

If we code coda as -0.5 and onset as +0.5, for coda consonants the model will be:

\begin{align} \hat{C}_i &= \beta_0 + \beta_1 V_i - 0.5\beta_2 - 0.5\beta_3 V_i\\ \hat{C}_i &= (\beta_0 - 0.5\beta_2) + (\beta_1 - 0.5\beta_3) V_i \end{align}

and for onset consonants it will be:

\begin{align} \hat{C}_i &= \beta_0 + \beta_1 V_i + 0.5\beta_2 + 0.5\beta_3 V_i\\ \hat{C}_i &= (\beta_0 + 0.5\beta_2) + (\beta_1 + 0.5\beta_3) V_i \end{align}

In [30]:
cons['pros_ec'] = -1*cons['prosody']+.5
cons.head()
Out[30]:
voice vlab place plab manner mlab prosody slab noipow voipow condur vowdur token pros_ec
0 1 voiced 0 labial 0 stop 1 coda 142.09 117.84 108 214 1 -0.5
1 1 voiced 0 labial 0 stop 0 onset 138.55 119.30 15 184 1 0.5
2 1 voiced 1 alveolar 0 stop 1 coda 125.93 117.03 96 226 1 -0.5
3 1 voiced 1 alveolar 0 stop 0 onset 140.72 119.62 17 208 1 0.5
4 0 voiceless 0 labial 1 fricative 1 coda 142.80 111.24 220 168 1 -0.5
In [31]:
fit_c2 = smf.ols('condur ~ vowdur + pros_ec + vowdur:pros_ec', data=cons).fit()
print(fit_c2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 condur   R-squared:                       0.442
Model:                            OLS   Adj. R-squared:                  0.437
Method:                 Least Squares   F-statistic:                     83.47
Date:                Fri, 23 Mar 2018   Prob (F-statistic):           8.55e-40
Time:                        11:20:04   Log-Likelihood:                -1732.2
No. Observations:                 320   AIC:                             3472.
Df Residuals:                     316   BIC:                             3487.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept        148.6635     14.137     10.516      0.000     120.849     176.478
vowdur            -0.0466      0.055     -0.847      0.397      -0.155       0.062
pros_ec         -154.6420     28.274     -5.469      0.000    -210.271     -99.013
vowdur:pros_ec     0.2382      0.110      2.164      0.031       0.022       0.455
==============================================================================
Omnibus:                        6.439   Durbin-Watson:                   1.529
Prob(Omnibus):                  0.040   Jarque-Bera (JB):                5.942
Skew:                           0.276   Prob(JB):                       0.0513
Kurtosis:                       2.624   Cond. No.                     2.38e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.38e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

We can visualize the data and model prediction for onset and coda separately by specifying the col and hue arguments in lmplot:

In [32]:
sns.lmplot(x='vowdur', y='condur', data=cons, col='slab',
           hue='slab', size=8, scatter_kws={'s':100});

Multilevel linear regression

The statsmodels package also has the ability to fit 'mixed effects' linear models. I prefer to refer to this kind of model as a multilevel model. The basic idea is that if you have multiple observations within some grouping unit (e.g., multiple acoustic measurements for a given talker), you can simultaneously (a) fit a model to each unit's data and (b) fit a model governing variation across these units.

A more concrete example will probably help clarify. In this case, we'll be predicting consonant duration as a function of each consonant's voicing (voiceless = 0, voiced = 1) and manner of articulation (stop = 0, fricative = 1).

Recall that we just took the first observation of the consonant acoustic data above. The full data set has 10 repetitions of each of 8 consonants from each speaker (in each syllable position). In theory, we could fit a model to each set of repetitions from each speaker. If we did this, we would get beta coefficients for each subject. A multilevel model allows us to do this, while simultaneously modeling the variation in the beta coefficients across speakers.

Here's what the full data set looks like:

In [33]:
cons_all.head()
Out[33]:
consonant voice place manner prosody condur vowdur subject token noipow voipow vlab plab mlab slab
2105 b 1 0 0 1 108 214 1 1 142.09 117.84 voiced labial stop coda
3164 b 1 0 0 1 102 229 1 2 135.18 118.60 voiced labial stop coda
2521 b 1 0 0 1 104 222 1 3 141.30 115.59 voiced labial stop coda
1864 b 1 0 0 1 150 248 1 4 142.81 122.65 voiced labial stop coda
2580 b 1 0 0 1 146 266 1 5 137.66 117.91 voiced labial stop coda

We will fit a multilevel model to the coda data. In order to fit this model, we need a formula for the 'fixed effect' model (i.e., the group-level model) and a formula for the 'random effects' (i.e., the individual-level model). Note that the latter only refers to the independent variables, with a 1 indicating the intercept (which is included in the group-level model by default). Note, too, that we specify that subject is the grouping variable.

Note, too, that voice is coded 0 for voiceless and 1 for voiced consonants, and manner is coded 0 for stops and 1 for fricatives. This means that the overall group-level model is:

\begin{equation} \hat{C}_i = \beta_0 + \beta_1 voice + \beta_2 manner \end{equation}

So, for voiceless stops, this reduces to:

\begin{equation} \hat{C}_i = \beta_0 \end{equation}

whereas for voiced stops, it is:

\begin{equation} \hat{C}_i = \beta_0 + \beta_1 \end{equation}

and for voiceless fricatives, it is:

\begin{equation} \hat{C}_i = \beta_0 + \beta_2 \end{equation}

and, finally, for voiced fricatives, it is:

\begin{equation} \hat{C}_i = \beta_0 + \beta_1 + \beta_2 \end{equation}

In [34]:
cons_coda = cons_all.loc[cons_all['slab']=='coda',:].copy() # get just coda data

# specify "random effects" formula
#  this is the model fit to each subject's data
re_formula = '1 + voice + manner'

# specify "fixed effects" formula
#  this is the model governing the subject-level models
fe_formula = 'condur ~ voice + manner'

# create and fit model
fit_e = smf.mixedlm(formula=fe_formula, re_formula=re_formula,
                    groups=cons_coda['subject'], data=cons_coda).fit()
print(fit_e.summary())
                 Mixed Linear Model Regression Results
========================================================================
Model:                  MixedLM      Dependent Variable:      condur    
No. Observations:       1600         Method:                  REML      
No. Groups:             20           Scale:                   1387.9664 
Min. group size:        80           Likelihood:              -8116.2435
Max. group size:        80           Converged:               Yes       
Mean group size:        80.0                                            
------------------------------------------------------------------------
                          Coef.   Std.Err.    z    P>|z|  [0.025  0.975]
------------------------------------------------------------------------
Intercept                 190.318    7.441  25.577 0.000 175.733 204.902
voice                     -52.723    4.351 -12.116 0.000 -61.251 -44.194
manner                     36.257    5.229   6.934 0.000  26.008  46.507
Intercept RE             1055.335    9.699                              
Intercept RE x voice RE  -458.652    5.034                              
voice RE                  309.279    3.317                              
Intercept RE x manner RE -387.915    5.466                              
voice RE x manner RE      120.103    2.899                              
manner RE                 477.509    4.790                              
========================================================================

We can extract the estimates of the 'random effects', or the individual speakers' beta coefficients:

In [35]:
re_df = pd.DataFrame(fit_e.random_effects).T
re_df.shape # n talkers, n model parameters
Out[35]:
(20, 3)
In [36]:
re_df.head()
Out[36]:
Intercept voice manner
1 -30.803304 3.624868 25.643651
2 18.659346 -11.154684 22.345822
3 2.351439 16.664398 31.169308
4 -35.370452 2.400945 30.389889
5 -26.445095 10.216490 -2.533453

We can then plot the distributions of these values to visualize how the different speakers vary with respect to how voicing and manner influence consonant duration:

In [37]:
b0, b1, b2 = fit_e.params.loc[['Intercept','voice','manner']]
sns.pairplot(re_df + np.stack([b0,b1,b2]), size=3, plot_kws={'s':100},
             diag_kws={'histtype':'step', 'lw':3}, diag_kind='hist');

Finally, we can visualize the data using violin plots organized by speaker (panel), voicing (x-axis), and manner (color):

In [38]:
fig = plt.figure(figsize=(20,20))
for i in range(5):
    for j in range(4):
        idx = j + 4*i + 1
        ax = plt.subplot(5,4,idx)
        rows = np.in1d(cons_coda['subject'],idx)
        df_s = cons_coda.loc[rows,['condur','vlab','mlab']]
        sns.violinplot(y='condur', x='vlab', data=df_s, hue='mlab', ax=ax, split=True, order=['voiceless','voiced'])
        ax.set_yticks([])
        if idx not in [1,5,9,13,17]:
            ax.set_ylabel('')
        else:
            ax.set_ylabel('Con. Dur.')
        if idx not in [17,18,19,20]:
            ax.set_xlabel('')
            ax.set_xticklabels(['',''])
        else:
            ax.set_xlabel('Voicing')
        if idx < 20:
            ax.legend_.remove()
        else:
            ax.legend_.set_title('')
            
#fig.savefig('condur_by_svm.png',bbox_inches='tight')

The statsmodels package also allows us to get standard ANOVA results from an OLS model fit:

In [20]:
cons_onset = cons.loc[cons['slab']=='onset',:] # pull out onset data
# fit model with main effects and interactions for voicing, place, and manner
# 'condur ~ voice*place*manner' is same as 
#   'condur ~ voice + place + manner
#              + voice:place + voice:manner + place:manner
#              + voice:place:manner
condur_vpm_fit = smf.ols('condur ~ voice*place*manner', data=cons_onset).fit()
condur_vpm_ANOVA = sm.stats.anova_lm(condur_vpm_fit)
print(np.round(condur_vpm_ANOVA,2))
                       df     sum_sq    mean_sq       F  PR(>F)
voice                 1.0  125160.16  125160.16  118.58    0.00
place                 1.0    7062.31    7062.31    6.69    0.01
voice:place           1.0      31.51      31.51    0.03    0.86
manner                1.0  196070.01  196070.01  185.77    0.00
voice:manner          1.0     500.56     500.56    0.47    0.49
place:manner          1.0    4070.31    4070.31    3.86    0.05
voice:place:manner    1.0     218.56     218.56    0.21    0.65
Residual            152.0  160429.35    1055.46     NaN     NaN
/Users/noahsilbert/anaconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/noahsilbert/anaconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/noahsilbert/anaconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

Statsmodels also has functions for estimating group-level effects using a generalized estimating equation. GEEs provide a robust way to test group-level effects with clustered (i.e., multilevel structured) data even when the covariance structure of the data (conditional on the model) is misspecified. We didn't go into detail about the covariance structure with the 'mixed effect' model above, but these models can produce questionable results when it is misspecified.

Fitting a GEE model and viewing the results is very similar to fitting and viewing a mixed effect model:

In [40]:
form = 'condur ~ voice + manner'
fit_f = smf.gee(form, groups=cons_coda['subject'], data=cons_coda).fit()
print(fit_f.summary())
                               GEE Regression Results                              
===================================================================================
Dep. Variable:                      condur   No. Observations:                 1600
Model:                                 GEE   No. clusters:                       20
Method:                        Generalized   Min. cluster size:                  80
                      Estimating Equations   Max. cluster size:                  80
Family:                           Gaussian   Mean cluster size:                80.0
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Fri, 23 Mar 2018   Scale:                        2018.439
Covariance type:                    robust   Time:                         11:55:26
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    190.3175      7.254     26.235      0.000     176.099     204.536
voice        -52.7225      4.242    -12.429      0.000     -61.037     -44.408
manner        36.2575      5.098      7.112      0.000      26.266      46.249
==============================================================================
Skew:                          0.9565   Kurtosis:                       6.3993
Centered skew:                 0.9019   Centered kurtosis:              8.2110
==============================================================================

Generalized linear models

We can also fit generalized linear models by using the glm method rather than the ols method, in which case we have to specify exactly which kind of glm we're fitting.

In this case, we will fit a logistic regression model to predict voicing (voiced = 1, voiceless = 0) as a function of vowel duration, syllable position, and the interaction between vowel duration and syllable position. Note that, using the formula syntax, vowdur*prosody is equivalent to vowdur + prosody + vowdur:prosody.

Generalized linear models allow us to fit models to dependent variables that are not normally distributed around the expected value (e.g., the fitted value in the OLS models we fit above). The key additional component of a GLM is the link function, which transforms the linear combination of variables to produce an appropriate expected value for the type of dependent variable in question.

For logistic regression, the dependent variable for each row of the data should be either a Bernoulli random variable (e.g., a variable indicating if a single coin flip is heads = 1 or tails = 0) or a Binomial random variable, which is a sum of independent Bernoulli random variables (e.g., the number of heads = k in the total number of coin flips = N).

For Bernoulli random variable $y$ with probablity $\theta$ of a 'heads', we write:

\begin{equation} \Pr(y) = \theta^y (1-\theta)^{1-y} \end{equation}

If $y = 1$, this reduces to:

\begin{equation} \Pr(y) = \theta \end{equation}

And if $y = 0$, this reduces to:

\begin{equation} \Pr(y) = (1-\theta) \end{equation}

The link function for logistic regression with dependent variable $y_i$ takes a linear combination of independent variables $X_i$ and maps it onto the interval (0,1) (i.e., it ensures that it is a reasonable value for a probability):

\begin{equation} \Pr(y_i = 1) = \frac{1}{1+\exp\left(-\left(\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} \right)\right)} \end{equation}

Since we know how to make arrays of values and plots to illustrate functions of those values, let's do that with the logistic link function:

In [41]:
def logistic(x):
    return 1/(1+np.exp(-x))

x = np.linspace(-5,5,500)
y1 = logistic(0+1*x); y2 = logistic(-1+1*x); y3 = logistic(3+1*x)
y4 = logistic(0+2*x); y5 = logistic(0+.35*x)

fig, axes = plt.subplots(1,2,figsize=(15,15))
ax1, ax2 = axes; ax1.set_aspect(10); ax2.set_aspect(10)
ax1.plot(x,y1,color=blue,lw=4); ax1.plot(x,y2,color=red,lw=4); ax1.plot(x,y3,color=green,lw=4)
ax2.plot(x,y1,color=blue,lw=4); ax2.plot(x,y4,color=red,lw=4); ax2.plot(x,y5,color=green,lw=4)
ax1.set_ylabel(r'$\theta$'); ax1.set_xlabel('x'); ax2.set_xlabel('x');

We can use the glm method to create and fit a logistic regression model as follows:

In [44]:
cons.head()
Out[44]:
voice vlab place plab manner mlab prosody slab noipow voipow condur vowdur token pros_ec
0 1 voiced 0 labial 0 stop 1 coda 142.09 117.84 108 214 1 -0.5
1 1 voiced 0 labial 0 stop 0 onset 138.55 119.30 15 184 1 0.5
2 1 voiced 1 alveolar 0 stop 1 coda 125.93 117.03 96 226 1 -0.5
3 1 voiced 1 alveolar 0 stop 0 onset 140.72 119.62 17 208 1 0.5
4 0 voiceless 0 labial 1 fricative 1 coda 142.80 111.24 220 168 1 -0.5
In [42]:
fit_d = smf.glm('voice ~ vowdur*prosody', data=cons, family=sm.families.Binomial()).fit()
print(fit_d.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  voice   No. Observations:                  320
Model:                            GLM   Df Residuals:                      316
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -190.39
Date:                Fri, 23 Mar 2018   Deviance:                       380.78
Time:                        12:02:56   Pearson chi2:                     343.
No. Iterations:                     5                                         
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -2.0607      0.826     -2.494      0.013      -3.680      -0.442
vowdur             0.0081      0.003      2.535      0.011       0.002       0.014
prosody           -4.4607      1.381     -3.231      0.001      -7.167      -1.755
vowdur:prosody     0.0187      0.006      3.399      0.001       0.008       0.030
==================================================================================

The coefficients of a logistic regression model are log odds-ratios. The odds of a dichotomous event are ratio of the probability of the event occurring and the probability of it not occurring:

\begin{equation} odds = \frac{p}{1-p} \end{equation}

The odds ratio is a ratio of odds (for p at, e.g., levels 1 and 2):

\begin{equation} OR = \frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}} \end{equation}

Here is a notebook all about why odds ratios are uninterpretable and bad. It also briefly discusses a more intuitive way to interpret logistic regression coefficients.

Seaborn can plot data and fitted values for logistic regression, too:

In [46]:
sns.lmplot(x='vowdur', y='voice', data=cons, hue='slab', col='slab',
           logistic=True, size=8, y_jitter=.025, scatter_kws={'s':100});

As with the linear model, we can generate predicted means (probabilities) for unobserved independent variable values:

In [47]:
vowdur_new = np.concatenate((np.<