In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import pystan as ps
from scipy.stats import gaussian_kde as kde
import pickle
from glob import glob

sns.set(style='darkgrid', font_scale=2)
blue, green, red, purple = sns.color_palette('deep', n_colors=4)

Bayesian Model Fitting with Stan

As we saw last week, with very simple models, we can estimate the posterior distribution for a model parameter (or maybe a couple parameters) using grid approximation. With more complex (and realistic and useful) models, grid approximation is too computationally expensive.

Thankfully, there are now multiple powerful software packages that allow us to specify and fit models to data. We'll be using Stan, via it's Python interface PyStan (you can find installation instructions here).

Like other Bayesian model fitting packages, Stan fits a model by sampling from the posterior distribution of the parameter(s) $\theta$, conditional on data $D$. Recall that Bayes' rule shows how the posterior $\Pr(\theta|D)$ is related to the likelihood $\Pr(D|\theta)$ and the prior $\Pr(\theta)$:

\begin{equation} \Pr(\theta | D ) \propto \Pr(D|\theta)\Pr(\theta) \end{equation}

Different packages use different algorithms for sampling from the posterior. Stan is relatively new, and it is very powerful and efficient. It also has a very active community of developers, so it's very likely to continue improving.

To use Stan, you need to specify a model in a specific format. With PyStan, you put your data in a dictionary in order to fit the model. When the model is fit, you get back (among other things) samples that (hopefully) closely approximate the full posterior distribution of your model's parameters.

You can then use these samples to visualize the fitted model, generate predictions, and calculate summary statistics.

We'll go back to the consonant acoustics data we've looked at a couple times and illustrate a simple linear regression model and a multilevel linear regression model, predicting consonant noise power as a function of consonant duration.

First, we'll read the data in. The full data set has 10 repetitions of each of multiple consonants produced in syllable onset and syllable coda position by each of 20 speakers. For the simple linear regression model, we'll just take the first observation of each coda consonant from each speaker:

In [56]:
cons_all = pd.read_csv('consonant_acoustics_subset_labels.csv') # read in data file
cons_coda = cons_all.loc[cons_all['slab']=='coda',:].copy() # just coda data
cons_coda.sort_values(by=['subject','consonant','token'], inplace=True) # sort the data
cons_first = cons_coda.groupby(['subject','consonant']).first() # take just the first repetition
cons_first.reset_index(level=['subject','consonant'], inplace=True) # make index into columns
cols = ['voice','vlab','place','plab','manner','mlab','noipow','voipow','condur','vowdur','token']
cons = cons_first[cols].copy() # make a copy, with just the subset of columns in cols
cons.head(n=10)
Out[56]:
voice vlab place plab manner mlab noipow voipow condur vowdur token
0 1 voiced 0 labial 0 stop 142.09 117.84 108 214 1
1 1 voiced 1 alveolar 0 stop 125.93 117.03 96 226 1
2 0 voiceless 0 labial 1 fricative 142.80 111.24 220 168 1
3 0 voiceless 0 labial 0 stop 145.16 120.06 143 168 1
4 0 voiceless 1 alveolar 1 fricative 152.83 113.49 215 193 1
5 0 voiceless 1 alveolar 0 stop 155.64 110.30 133 186 1
6 1 voiced 0 labial 1 fricative 133.61 108.47 150 285 1
7 1 voiced 1 alveolar 1 fricative 156.10 120.27 213 243 1
8 1 voiced 0 labial 0 stop 140.56 116.60 210 212 1
9 1 voiced 1 alveolar 0 stop 140.89 118.72 189 265 1

We'll extract and standardize the noise power and consonant duration data as numpy arrays, stacking the latter with a column of ones so that we can fit a model with an intercept term. Let $y$ be the $N_{obs}$ noise power observations, let $x$ be the corresponding $N_{obs}$ consonant duration observations, and let $\boldsymbol{\beta} = \left(\beta_0, \beta_1\right)^T$ be a length 2 vector of regression coefficients.

Then the simplest expression of the model is:

\begin{align} \hat{y} &= \beta_0 + \beta_1 x\\ y &\sim \mathrm{normal}(\hat{y}, \sigma)\\ \beta_0 &\sim \mathrm{normal}(0,1)\\ \beta_1 &\sim \mathrm{normal}(0,1)\\ \sigma &\sim \mathrm{log\rm{-}normal}(1,2) \end{align}

We will look at this model written for Stan in a separate file.

We can visualize the prior on $\sigma$ to get a sense of what this model is assuming (defining our own function to match the log-normal function in Stan, which differs slightly from the one in scipy.stats, I think):

In [57]:
def lognorm(x, mu=1, s=2):
    return (1/(x*s*np.sqrt(2*np.pi)))*np.exp(-((np.log(x)-mu)**2)/(2*s**2))
In [58]:
sgv = np.linspace(0.001,2.5,1000)
fig, ax = plt.subplots(1, 1, figsize=(15,5))
ax.plot(sgv, lognorm(sgv, mu=1, s=2));

Standardizing the data:

In [59]:
y = cons['noipow'].as_matrix() # noipow = dependent/endogenous variable
y = (y-y.mean())/y.std() # standardize
x0 = np.ones(y.shape[0]) # for just estimating an intercept
x1 = cons['condur'].as_matrix() # extract consonant duration
x1 = (x1-x1.mean())/x1.std() # standardize
X = np.stack((np.ones(y.shape[0]),x1),axis=1) # ones for intercept, condur = ind/exog variable
X[:5,:]
Out[59]:
array([[ 1.        , -1.43407088],
       [ 1.        , -1.65637154],
       [ 1.        ,  0.64073536],
       [ 1.        , -0.78569393],
       [ 1.        ,  0.54811009]])

Next, we prepare the data for Stan. Note that the keys in the dictionary have to match the names of the variables in the model.

In [60]:
Nobs = len(y)
Nprd = X.shape[1]

dd = {'y':y, 'X':X, 'Nobs':Nobs, 'Nprd':Nprd}

Next, we compile the model. This can take a while, so if we are satisfied with the model, we can save it after compilation and just reload it if we want to fit it (e.g., to a new data set). We can use glob to get a list of the .pkl files in the directory, then check to see if the model pickle name is in that list. If not, then we can compile the model and pickle it with that name for future checking and use. If so, the we can just load it.

In [61]:
pkl_list = glob('*.pkl') # get a list of all .pkl files in the directory

model_lin_reg_pkl = 'mod_lin_reg.pkl' # name of the model file

if model_lin_reg_pkl in pkl_list:
    with open(model_lin_reg_pkl,'rb') as f:
        model_lin_reg = pickle.load(f)
else:
    model_lin_reg = ps.StanModel(file='lin_reg_model.stan') # model compilation
    with open(model_lin_reg_pkl,'wb') as f:
        pickle.dump(model_lin_reg,f)

Next, we set a few variables that will tell Stan how many samples to take from the posterior (n_iter), how many from the beginning of the sampling to throw away (warm_up), how much the chains of samples should be thinned to reduce autocorrelation (thin), how many chains of samples to get (n_chain), and how many cores on the computer to use (n_jobs).

Then we use the sampling() method of the model object we created above to fit the model:

In [62]:
n_iter = 1000
warm_up = 500
thin = 1
n_chain = 4
n_jobs = 1

fit_lin_reg = model_lin_reg.sampling(data=dd, chains=n_chain, iter=n_iter, warmup=warm_up, thin=thin, n_jobs=n_jobs)

Once the model is fit, we can get the samples using the extract() method of the fitted model object. If we specify permuted=False, this will give us all of our chains separately, which is useful for checking various diagnostics (e.g., to make sure that the multiple chains are mixed, i.e., sampling similar distributions of values for each parameter; we'll look at some diagnostics in more detail outside of this notebook):

In [63]:
traces = fit_lin_reg.extract(permuted=False, inc_warmup=False)
par_names = fit_lin_reg.flatnames
par_names[:7]
WARNING:root:`dtypes` ignored when `permuted` is False.
Out[63]:
['beta[0]', 'beta[1]', 'sigma', 'y_hat[0]', 'y_hat[1]', 'y_hat[2]', 'y_hat[3]']

The traces object has 500 rows (1000 iterations minus 500 warm up samples), four columns (one for each chain), and 164 layers (the first three for model parameter, the next 160 for y_hat values, and the last for the log probability of the model):

In [64]:
traces.shape
Out[64]:
(500, 4, 164)

We can extract the intercept, slope, and error SD samples, using the locations of the parameter names we want in par_names:

In [65]:
beta0, beta1, sigma = traces[:,:,0], traces[:,:,1], traces[:,:,2]
beta1.shape
Out[65]:
(500, 4)

Here are some simple trace plots and kernel density estimates for visualizing the chains (the trace plots) and the distributions of the samples across chains:

In [66]:
fig, axes = plt.subplots(2, 3, figsize=(20,10))
ax_beta0_ch, ax_beta1_ch, ax_sigma_ch = axes[0,:]
ax_beta0_kd, ax_beta1_kd, ax_sigma_kd = axes[1,:]
beta0_xv = np.linspace(-.5,.5,100)
beta1_xv = np.linspace(-.25,.75,100)
sigma_xv = np.linspace(0.5,1.5,100)
for chi in range(n_chain):
    ax_beta0_ch.plot(beta0[:,chi],alpha=.5)
    ax_beta0_ch.set(xlabel='sample')
    ax_beta1_ch.plot(beta1[:,chi],alpha=.5)
    ax_beta1_ch.set(xlabel='sample')
    ax_sigma_ch.plot(sigma[:,chi],alpha=.5)
    ax_sigma_ch.set(xlabel='sample')
    ax_beta0_kd.plot(beta0_xv,kde(beta0[:,chi]).pdf(beta0_xv),alpha=.75)
    ax_beta0_kd.set(xlabel=r'$\hat{\beta}_0$')
    ax_beta1_kd.plot(beta1_xv,kde(beta1[:,chi]).pdf(beta1_xv),alpha=.75)
    ax_beta1_kd.set(xlabel=r'$\hat{\beta}_1$')
    ax_sigma_kd.plot(sigma_xv,kde(sigma[:,chi]).pdf(sigma_xv),alpha=.75)
    ax_sigma_kd.set(xlabel=r'$\hat{\sigma}$')

If we are satisfied that the chains are well-mixed, we can then extract the samples with the chains mixed. We can use these samples to generate predictions, visualize the fitted model, and analyze the posterior distributions of the parameters to draw inferences about the relationships between consonant duration and noise power.

The object samples is a dictionary, with keys corresponding to variable names:

In [67]:
samples = fit_lin_reg.extract()

There are two $\beta$ parameters - the intercept and the slope - and there are 500 samples in each of the four chains:

In [68]:
beta = samples['beta']
beta.shape
Out[68]:
(2000, 2)

There is just one $\sigma$ parameter, also with 2000 samples:

In [69]:
sigma = samples['sigma']
sigma.shape
Out[69]:
(2000,)

Let's visualize the posterior distribution of fitted regression lines (i.e., what the model things the relationship between consonant duration and noise power looks like):

In [70]:
Xv = np.linspace(-3,3,250) # hypothetical consonant duration values
n_smp = 1000 # number of samples to extract and visualize
y_hat = np.zeros((n_smp,250)) # initialize fitted y value array
s_idx = np.random.permutation(beta.shape[0])[:n_smp] # randomly pick indices for our n_smp samples
for i,s in enumerate(s_idx):
    y_hat[i,:] = beta[s,0] + beta[s,1]*Xv
In [71]:
fig, ax = plt.subplots(1, 1, figsize=(8,8))
ax.plot(Xv, y_hat.T, lw=3, color=red, alpha=.1)
ax.plot(X[:,1], y, 'o', color=blue, ms=8, alpha=.75)
ax.set(xlabel='consonant duration (z)', ylabel='noise power (z)');

Next, we'll generate predictions by using a sample of these lines as the expected value of y and generating predicted y (z-transformed noise power) values, incorporating our estimates of $\sigma$, as well:

In [79]:
n_smp = 1000 # number of samples to pick out from fittd model
n_pred = 50 # number of x values at which to predict y
X_pred = np.linspace(-3,3,n_pred) # array of x values
y_pred = np.zeros((n_smp,n_pred)) # initialize for storing predictions
s_idx = np.random.permutation(beta.shape[0])[:n_smp]
for i,s in enumerate(s_idx):
    yht = beta[s,0] + beta[s,1]*X_pred
    sgt = sigma[s]
    y_pred[i,:] = np.random.normal(loc=yht, scale=sgt)

For plotting purposes, we will get the middle 80% of the distribution of predicted y values:

In [80]:
y_up, y_lo = np.percentile(y_pred, q=90, axis=0), np.percentile(y_pred, q=10, axis=0)

Then we can plot the results:

In [81]:
fig, ax = plt.subplots(1, 1, figsize=(8,8))
ax.fill_between(x=X_pred, y1=y_lo, y2=y_up, color=red, alpha=.5)
ax.plot(X[:,1], y, 'o', color=blue, ms=8, alpha=.85)
ax.set(xlabel='consonant duration (z)', ylabel='noise power (z)');

Recall that we extracted just the first observation for each subject's production of each consonant above. We can illustrate the specification, fitting, and evaluation of a multilevel model with the full set of coda production data (i.e., all 10 repetitions of each consonant for each speaker). This model is more complicated, but it more accurately represents the structure of the data. The subscript $s$ is an index for speaker, and the subscript $t$ for token (or repetition) within speaker:

\begin{align} \hat{y}_{st} &= \beta_{0s} + \beta_{1s} x_{st}\\ y_{st} &\sim \mathrm{normal}(\hat{y}_{st}, \sigma_{s})\\ \beta_{0s} &\sim \mathrm{normal}(\mu_0,\tau_0)\\ \beta_{1s} &\sim \mathrm{normal}(\mu_1,\tau_1)\\ \sigma_s &\sim \mathrm{log\rm{-}normal}(1,2)\\ \mu_0 &\sim \mathrm{normal}(0,1)\\ \mu_1 &\sim \mathrm{normal}(0,1)\\ \tau_0 &\sim \mathrm{log\rm{-}normal}(1,2)\\ \tau_1 &\sim \mathrm{log\rm{-}normal}(1,2)\\ \end{align}

The basic idea here is that each speaker's data gets its own linear regression model (i.e., a pair of $\beta$ parameters and a $\sigma$ parameter), and then the $\beta$ parameters are themselves modeled as random variables governed by $\mu$ and $\tau$ parameters.

In [21]:
cons_coda.head()
Out[21]:
consonant voice place manner prosody condur vowdur subject token noipow voipow vlab plab mlab slab
2105 b 1 0 0 1 108 214 1 1 142.09 117.84 voiced labial stop coda
3164 b 1 0 0 1 102 229 1 2 135.18 118.60 voiced labial stop coda
2521 b 1 0 0 1 104 222 1 3 141.30 115.59 voiced labial stop coda
1864 b 1 0 0 1 150 248 1 4 142.81 122.65 voiced labial stop coda
2580 b 1 0 0 1 146 266 1 5 137.66 117.91 voiced labial stop coda
In [22]:
cols = ['noipow','voipow','condur','vowdur','subject','token']
cons = cons_coda[cols].copy() # make a copy, with just the subset of columns in cols
cons.head()
Out[22]:
noipow voipow condur vowdur subject token
2105 142.09 117.84 108 214 1 1
3164 135.18 118.60 102 229 1 2
2521 141.30 115.59 104 222 1 3
1864 142.81 122.65 150 248 1 4
2580 137.66 117.91 146 266 1 5

Prepare the data for Stan:

In [23]:
Nobs = np.sum(np.in1d(cons['subject'],1))
Nsub = len(cons['subject'].unique())
Nprd = 2

yv = cons['noipow'].as_matrix() # noipow = dependent/endogenous variable
yv = (yv-yv.mean())/yv.std() # standardize
y = yv.reshape((Nsub,Nobs))

X = np.zeros((Nsub,Nobs,Nprd))
xv = cons['condur'].as_matrix() # extract consonant duration
xv = (xv-xv.mean())/xv.std() # standardize
x1 = xv.reshape((Nsub,Nobs))
x0 = np.ones(Nobs) # for just estimating an intercept
for i in range(Nsub):
  X[i,:,:] = np.stack((x0,x1[i,:]),axis=1)

dd = {'y':y, 'X':X, 'Nobs':Nobs, 'Nprd':Nprd, 'Nsub':Nsub}

Compile or load the model, as we did above:

In [24]:
model_ml_pkl = 'lin_reg_ml_model.pkl'

if model_ml_pkl in pkl_list:
    with open(model_ml_pkl,'rb') as f:
        model = pickle.load(f)
else:
    model_lr_ml = ps.StanModel(file='lin_reg_ml_model.stan')
    with open(model_ml_pkl,'wb') as f:
        pickle.dump(model_lr_ml,f)

Fit the model (note the changes in these parameters):

In [25]:
n_iter = 11000
warm_up = 1000
thin = 10
n_chain = 4
n_jobs = 1

fit_lr_ml = model.sampling(data=dd, chains=n_chain, iter=n_iter, warmup=warm_up, thin=thin, n_jobs=n_jobs)
In [26]:
traces = fit_lr_ml.extract(permuted=False)
par_names = fit_lr_ml.flatnames
par_names[37:68]
WARNING:root:`dtypes` ignored when `permuted` is False.
Out[26]:
['beta[17,1]',
 'beta[18,1]',
 'beta[19,1]',
 'mu[0]',
 'mu[1]',
 'sigma[0]',
 'sigma[1]',
 'sigma[2]',
 'sigma[3]',
 'sigma[4]',
 'sigma[5]',
 'sigma[6]',
 'sigma[7]',
 'sigma[8]',
 'sigma[9]',
 'sigma[10]',
 'sigma[11]',
 'sigma[12]',
 'sigma[13]',
 'sigma[14]',
 'sigma[15]',
 'sigma[16]',
 'sigma[17]',
 'sigma[18]',
 'sigma[19]',
 'tau[0]',
 'tau[1]',
 'y_hat[0,0]',
 'y_hat[1,0]',
 'y_hat[2,0]',
 'y_hat[3,0]']
In [27]:
traces.shape
Out[27]:
(1000, 4, 1665)

We can get the variable indices of interest:

In [28]:
beta_idx = [i for i,j in enumerate(par_names) if 'beta' in j]
mu_idx = [i for i,j in enumerate(par_names) if 'mu' in j]
sg_idx = [i for i,j in enumerate(par_names) if 'sigma' in j]
tau_idx = [sg_idx[-1]+1, sg_idx[-1]+2]

We will only plot the chains for the group-level model (i.e., $\mu$ and $\tau$) here, but we would do this for all of the parameters typically (and probably save it to a file rather than having a ton of windows open):

In [29]:
fig, axes = plt.subplots(2,2,figsize=(15,15))
ax_mu0, ax_mu1, ax_tau0, ax_tau1 = axes.ravel()
for i in range(4):
    ax_mu0.plot(traces[:,i,mu_idx[0]],alpha=.75)
    ax_mu0.set_title(r'$\mu_0$')
    ax_mu1.plot(traces[:,i,mu_idx[1]],alpha=.75)
    ax_mu1.set_title(r'$\mu_1$')
    ax_tau0.plot(traces[:,i,tau_idx[0]],alpha=.75)
    ax_tau0.set_title(r'$\tau_0$')
    ax_tau1.plot(traces[:,i,tau_idx[1]],alpha=.75)
    ax_tau1.set_title(r'$\tau_1$')

Since the chains seem to be well-mixed, we extract the samples:

In [30]:
samples = fit_lr_ml.extract()
samples.keys()
Out[30]:
odict_keys(['beta', 'mu', 'sigma', 'tau', 'y_hat', 'lp__'])
In [31]:
beta = samples['beta']
beta.shape
Out[31]:
(4000, 20, 2)

Visualize the fitted models and data:

In [32]:
n_smp = 250
s_idx = np.random.permutation(beta.shape[0])[:n_smp]
Xv = np.linspace(1.1*X.min(),1.1*X.max(),250)

fig, axes = plt.subplots(5, 4, figsize=(20,20))

for i,axt in enumerate(axes.ravel()):
    for j,s in enumerate(s_idx):
        yht = beta[s,i,0] + beta[s,i,1]*Xv
        axt.plot(Xv,yht,lw=2,color=red,alpha=.1)
    axt.plot(X[i,:,1],y[i,:],'o',ms=8,color=blue,alpha=.75)
    axt.set_xticks([-2,0,2,4,6,8])
    axt.set(ylim=(1.1*y.min(),1.1*y.max()))
    if i >= 16:
        axt.set(xlabel='con dur (z)')
    else:
        axt.set_xticklabels([])
    if i in [0,4,8,12,16]:
        axt.set(ylabel='noi pow (z)')
    else:
        axt.set_yticklabels([])