Modules & Packages

Modules are files (ending in .py) containing Python definitions (functions, classes) and statements. You can import modules into other modules, interactive sessions, notebooks, etc. It can be convenient to put your own custom functions into modules, and there are many useful, freely available modules written by others.

Packages are (roughly speaking) collections of modules.

A large number of packages and modules are installed with the Anaconda distribution that you downloaded. In particular, SciPy, NumPy, Pandas, and Matplotlib all come with Anaconda. We will be using (bits and pieces) of these (and other packages) throughout the semester.

Anaconda also comes with a package manager called Conda, and Python comes with a package manager called pip. These links go to documentation for Conda and pip. We will talk about both at various points during the semester.

You can import a package (or a module) like this:

In [1]:
import numpy

After doing this, everything inside NumPy is available as a method (or a method within a module nested inside NumPy, sometimes multiple layers down). For example, we can create a NumPy array (technically, an ndarray, for "n-dimensional array"). An array is a very useful, very frequently used data type for containing (typically) numerical data.

To create an array, we can call the function array() and give it a list of numbers as input:

In [2]:
x = numpy.array([10,20,30,40,50])
type(x)
Out[2]:
numpy.ndarray
In [3]:
x
Out[3]:
array([10, 20, 30, 40, 50])

Arrays support indexing and slicing just like lists and tuples:

In [4]:
x[1]
Out[4]:
20
In [5]:
x[1:3]
Out[5]:
array([20, 30])

You can change (typically to abbreviate) the name of an imported module by importing it as whatever name you prefer. The standard abbreviation for numpy is np:

In [6]:
import numpy as np
y = np.array([30,40,50,60])
y
Out[6]:
array([30, 40, 50, 60])

Note that when we import a module as something else, we use that new name to access its modules/methods/etc.

Importing functions

If you just want a specific subset of the functions from a package or module, you can import those specific functions from that module or package. For example, we can import just the functions array() (creates an array), zeros() (creates an array filled with zeros), ones() (creates an array filled with ones), arange() (creates an array with a specified sequence of integers), and linspace() (creates an array with equally spaced numbers over a specified range):

In [7]:
from numpy import array, zeros, ones, arange, linspace
%whos
Variable   Type                          Data/Info
--------------------------------------------------
arange     builtin_function_or_method    <built-in function arange>
array      builtin_function_or_method    <built-in function array>
linspace   function                      <function linspace at 0x10400d378>
np         module                        <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
numpy      module                        <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
ones       function                      <function ones at 0x103fbb488>
x          ndarray                       5: 5 elems, type `int64`, 40 bytes
y          ndarray                       4: 4 elems, type `int64`, 32 bytes
zeros      builtin_function_or_method    <built-in function zeros>

Now we can use these functions without prepending np.

The function zeros() creates an array of zeros. It takes one mandatory argument specifying the shape of the array, and it has two optional arguments (use ? and run the code cell or type shift+tab to see the documentation). It can be convenient to use zeros() to create an array that you will fill up with values later on.

In [8]:
zeros?
In [9]:
z = zeros(shape=10)
z
Out[9]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

The function ones() creates an array of ones (it has the same arguments as zeros()). If you want to create a multidimensional array of ones (or zeros), the input argument should be a tuple indicating the size of each dimension.

In [10]:
o = ones((2,10))
print(o)
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

The function arange() creates an array containing numbers beginning at start, ending at (but not including) stop, in steps of size step. Note that you could also specify start, stop, and step using positional arguments (in that order).

In [11]:
arange?
In [12]:
ii = arange(start=0,stop=20,step=2)
ii
Out[12]:
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

The function linspace() creates an array of num equally-spaced numbers beginning at start, ending at (and including) stop (it also has some other optional arguments; use ? to see what these are):

In [13]:
linspace?
In [14]:
jj = linspace(start=0,stop=1,num=11)
jj
Out[14]:
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

Importing functions from a nested module

If you just want a specific set of functions from within a nested part of a package, you can import from that nested package. Here we'll import three random number generator functions from the random module inside NumPy:

In [15]:
from numpy.random import random, randint, normal

The random module contains a bunch of functions for generating different types of (pseudo-)random numbers. Here, we've imported the function random(), which generates random numbers between 0 and 1 (not including 1), along with the function randint() which generates random integers over a specified range, and the function normal() which generates random normally distributed numbers.

Here's the function random() in action:

In [16]:
up = 15 # upper limit
lo = 2 # lower limit
r = (up-lo)*random(size=(20,500)) + lo
r
Out[16]:
array([[ 7.94132624, 13.08076223, 12.21719033, ..., 13.34411432,
         4.84929464,  5.69486115],
       [ 2.0036704 , 10.14837733, 14.20263652, ..., 13.88057598,
        14.15165213,  7.99103753],
       [ 7.16143541,  2.958898  ,  6.60069133, ...,  4.18922231,
         3.45008825, 13.69239595],
       ...,
       [ 4.55965038,  6.92329164,  5.81386236, ..., 11.47587142,
        10.27499075,  4.45780261],
       [ 5.80492843,  7.03107547, 14.31097305, ...,  8.25381883,
         9.96045796,  3.57263132],
       [ 6.15557969, 14.62466346, 11.54366317, ...,  7.96616107,
        12.23767425,  2.86757766]])

Here's randint() in action:

In [17]:
lo_int = 5 # lower limit
hi_int = 65 # upper limit
s = randint(low=lo_int, high=hi_int, size=(200,50))
s
Out[17]:
array([[15, 56, 24, ...,  5, 64, 16],
       [20, 63, 31, ..., 56, 23, 35],
       [22, 33, 19, ..., 18,  8,  5],
       ...,
       [36, 46, 54, ...,  9, 16, 51],
       [19, 50, 39, ..., 34, 15, 35],
       [58, 64, 31, ..., 34, 25, 34]])

A brief aside on multidimensional indexing

When we learned about strings, lists, and tuples, we talked about indexing. We saw above that indexing works the same way with arrays, but this was only illustrated with a one-dimensional array. Indexing works for multidimensional arrays, too, but you have to provide indices for each dimension, and you have to understand where the indices for each dimension go.

If we want to pick out the element in the first row and the third column of our array of random numbers r, we should do this:

In [18]:
r[0,2] # row index 0, column index 2
Out[18]:
12.217190328032503

If we want the whole first column, we should do this (note that this returns a 1D array, and displays it as a row vector):

In [19]:
r[:,0]
Out[19]:
array([ 7.94132624,  2.0036704 ,  7.16143541,  2.11388055, 11.59485108,
        2.28026238, 11.89648488, 11.10462996, 14.56255919,  5.47557244,
        5.51596388,  3.18515436,  2.66609675, 12.32925587,  8.30697479,
       14.93791729, 12.81818478,  4.55965038,  5.80492843,  6.15557969])

If we want the second row of our array of random integers s, we should do this (again, this returns a 1D array, displaying it as a row vector):

In [20]:
s[1,:]
Out[20]:
array([20, 63, 31, 12, 12, 18, 11,  7, 33,  8, 34, 29, 33,  6, 58, 60, 44,
       63, 56, 20, 53,  6, 33, 11, 29,  6, 51,  5, 43, 15, 48, 15, 23, 18,
       32,  9, 13, 30, 54,  5, 19, 58, 39, 58, 48, 49, 59, 56, 23, 35])

As you might expect, slicing works for multidimensional arrays, too. Here's r again:

In [21]:
r
Out[21]:
array([[ 7.94132624, 13.08076223, 12.21719033, ..., 13.34411432,
         4.84929464,  5.69486115],
       [ 2.0036704 , 10.14837733, 14.20263652, ..., 13.88057598,
        14.15165213,  7.99103753],
       [ 7.16143541,  2.958898  ,  6.60069133, ...,  4.18922231,
         3.45008825, 13.69239595],
       ...,
       [ 4.55965038,  6.92329164,  5.81386236, ..., 11.47587142,
        10.27499075,  4.45780261],
       [ 5.80492843,  7.03107547, 14.31097305, ...,  8.25381883,
         9.96045796,  3.57263132],
       [ 6.15557969, 14.62466346, 11.54366317, ...,  7.96616107,
        12.23767425,  2.86757766]])

And here's a sliced subset of the elements of r, returning rows 0:3 and columns 1:3 (recall that slicing is left-inclusive and right-exclusive):

In [22]:
r[:3,1:3]
Out[22]:
array([[13.08076223, 12.21719033],
       [10.14837733, 14.20263652],
       [ 2.958898  ,  6.60069133]])

The same basic idea works for arrays with more than two dimensions (note the use of the array method reshape(), which in this case changes the dimensionality of the 1D array produced by arange(60)):

In [23]:
d = arange(60).reshape((3,4,5))
d
Out[23]:
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],

       [[20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39]],

       [[40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49],
        [50, 51, 52, 53, 54],
        [55, 56, 57, 58, 59]]])

You can index and/or slice rows (first), columns (second), and layers (third) with a 3D array like this one:

In [24]:
d[:2,1:3,2:]
Out[24]:
array([[[ 7,  8,  9],
        [12, 13, 14]],

       [[27, 28, 29],
        [32, 33, 34]]])

End brief aside

Here's the function normal() in action. With random() and randint(), we specified the lower and upper limits of the (uniformly distributed) random numbers that these functions generated. With normal(), we specify the location (in this case, the mean), scale (the standard deviation), and the size of the array:

In [25]:
n = normal(loc=23, scale=17, size=10000)
n
Out[25]:
array([45.85792322, 43.21577507, 14.20867407, ..., -3.72154597,
       21.14752302, 22.933308  ])

Sometimes you forget the size or shape of an array. You can easily find this out by using shape (note the lack of parentheses, since it's not a function, but a property of the array). Note, too, that the shape is given in a tuple:

In [26]:
r.shape
Out[26]:
(20, 500)

The shape is a tuple even if the array is 1D:

In [27]:
n.shape
Out[27]:
(10000,)

When we look at a large array (e.g., the 10,000-element array of random numbers n), the notebook gives us a truncated version, showing just the first few and last few elements. We can easily visualize large arrays using functions from Matplotlib.

The first line in the next cell makes figures appear "inline" in the notebook, the second imports the pyplot module from matplotlib and renames it.

In [28]:
%matplotlib inline
from matplotlib import pyplot as pl
pl.rc('xtick', labelsize=14); pl.rc('ytick', labelsize=14) # setting axis tick label size

In order to visualize our normally distributed random numbers, we first create figure and axis objects, specifying the number of rows and columns in the figure (i.e., subplots, in this case there is just one), and we specify its size (width, height).

Then we can use the axis method hist() (also a method inside pl) to create a histogram of the elements of n, specifying (optionally) the number of bins, the way the histogram is drawn, the line width and color, ...

In [29]:
fig, ax = pl.subplots(1,1,figsize=(12,6))
hh = ax.hist(n, bins=100, histtype='step', linewidth=3, color='m')

Here is a visualization of the uniformly distributed numbers in the array r:

In [30]:
fig, ax = pl.subplots(1,1,figsize=(12,6))
# note use of ravel method, which flattens multidimensional arrays
print(r.shape)
r_flat = r.ravel()
hh = ax.hist(r_flat, bins=100, histtype='step', lw=3, color='b')
(20, 500)

And here is the array of random integers. In this case, we will play around with the size and location of the bins:

In [31]:
fig, ax = pl.subplots(1,1,figsize=(12,6))
s_flat = s.ravel()
# note specification of bin array, rather than number of bins
int_bins = np.arange(start=lo_int, stop=hi_int+1, step=2)
# unequally spaced bins are possible, too
#int_bins = np.array([lo_int,10,25,hi_int])
hh = ax.hist(s_flat, bins=int_bins, histtype='step', lw=2, color='y')

Some array methods

Arrays have a lot of methods, including some basic statistical functions:

In [32]:
n.sum()
Out[32]:
229546.22846416268

We can specify an axis along which we take the sum:

In [33]:
r.sum(axis=1) # axis=0 sums across rows, axis=1 sums across columns, etc...
Out[33]:
array([4352.88915519, 4275.61821399, 4239.9440771 , 4331.8890251 ,
       4230.57041786, 4247.68158581, 4227.98505474, 4109.41938885,
       4244.45193935, 4322.94417486, 4286.5440044 , 4341.55828319,
       4308.90496784, 4157.56237989, 4156.48774729, 4212.66478281,
       4200.81356028, 4302.30695781, 4151.4043752 , 4300.48781824])

We can get the mean of an array using the method mean() (which is just the numpy function mean()):

In [34]:
# could also use np.mean(n)
n.mean()
Out[34]:
22.954622846416267

We can specify the axis along which we calculate the mean, too:

In [35]:
s.mean(axis=0)
Out[35]:
array([35.245, 35.98 , 33.115, 37.145, 33.685, 34.775, 32.655, 36.77 ,
       34.385, 33.57 , 33.285, 35.34 , 34.41 , 33.345, 34.75 , 31.98 ,
       33.89 , 32.85 , 35.585, 33.32 , 34.56 , 34.89 , 35.305, 33.78 ,
       34.83 , 34.08 , 34.67 , 35.015, 34.095, 36.5  , 34.605, 33.94 ,
       34.445, 33.72 , 34.11 , 34.6  , 32.79 , 34.365, 34.54 , 33.02 ,
       35.08 , 34.335, 35.375, 35.135, 34.995, 34.72 , 34.475, 34.455,
       34.67 , 36.09 ])

Arrays also have a method for calculating the standard deviation and variance of the elements:

In [36]:
n.std()
Out[36]:
16.956979995090997
In [37]:
n.var()
Out[37]:
287.5391705539162

Arrays have a lot of methods (remember that you can hit tab after the dot to see the available methods). Some other useful methods are max(), min(), argmax(), argmin().

In [38]:
n.max() # largest value in n
Out[38]:
92.73630770320565
In [39]:
n.argmax() # index of largest value in n
Out[39]:
8739

A brief aside on linear algebra

In our initial discussion of operators, we saw the operator @, but we didn't use it. This is a new(-ish) operator in Python 3 that performs matrix multiplication. Matrix multiplication was possible before, it just required (slightly) more complicated (and (slightly) less intuitive and less easy to read) code to perform.

NumPy has a special matrix type, but we can also use @ with 2-dimensional arrays. Suppose you have a $m \times n$ matrix $\mathbf{A}$ and a $n \times p$ matrix $\mathbf{B}$, where $a_{ij}$ indicates the element in the $i^{th}$ row and $j^{th}$ column of $\mathbf{A}$, and similarly for $b_{ij}$. Then the matrix product of $\mathbf{A}$ and $\mathbf{B}$ is $\mathbf{AB} = \mathbf{C}$, where $\mathbf{C}$ is a $m \times p$ matrix, where the element $c_{ij}$ is: \begin{align} c{ij} &= \sum{k=1}^n a{ik}b{kj}\ &= a{i1}b{1j} + a{i2}b{2j} + \cdots + a{in}b{nj} \end{align}

Because it may not be clear exactly what's going on here, and because this is actually a programming class, let's see how this works in Python.

First, we'll create a couple 2D arrays, one with random normal deviates in it, one with just ones:

In [40]:
A = ones((1,5)) # array of ones
B = zeros((5,3)) # array of zeros
B[:,0] = normal(loc=0, scale=2, size=5) # fill in column 0 with normal variates with mean 0, SD 2
B[:,1] = normal(loc=2, scale=4, size=5) # fill in column 0 with normal variates with mean 2, SD 4
B[:,2] = normal(loc=-3, scale=8, size=5) # fill in column 0 with normal variates with mean -3, SD 8

Note that $\mathbf{A}$ is a $1 \times 5$ array.

In [41]:
A
Out[41]:
array([[1., 1., 1., 1., 1.]])

While $\mathbf{B}$ is a $5 \times 3$ array.

In [42]:
B
Out[42]:
array([[  0.97707905,   1.01632284,   1.62054943],
       [ -1.1673219 ,  -0.26774133,  -9.77654482],
       [  0.82841598,   1.02749529, -15.60779026],
       [ -0.35531221,  -1.42456215,   1.38852808],
       [ -0.67574747,   3.98031977, -17.13577869]])

Let's see what happens when we code the final line in the equation above explicitly for the first row of $\mathbf{A}$ and the second column of $\mathbf{B}$:

In [43]:
A[0,0]*B[0,1] + A[0,1]*B[1,1] + A[0,2]*B[2,1] + A[0,3]*B[3,1] + A[0,4]*B[4,1]
Out[43]:
4.331834415670764

We could also use the NumPy function sum() and exploit the fact that the operator * multiplies element-by-element when applied to vectors (which are just one dimensional arrays, or, mathematically speaking, 1-by-n or n-by-1 matrices):

In [44]:
A[0,:]*B[:,1]
Out[44]:
array([ 1.01632284, -0.26774133,  1.02749529, -1.42456215,  3.98031977])
In [45]:
np.sum(A[0,:]*B[:,1])
Out[45]:
4.331834415670764

Finally, we can also use @ (and look at the element in the first row and second column) to get the same result:

In [46]:
C = A @ B
C[0,1]
Out[46]:
4.331834415670764

The @ operater does this (very) quickly for every row in $\mathbf{A}$ and every column in $\mathbf{B}$, and it makes the code simpler and easier to read:

In [47]:
C
Out[47]:
array([[ -0.39288656,   4.33183442, -39.51103626]])

Note that, because $\mathbf{B}$ just contains ones, in this case matrix multiplication is giving us the sums of the columns of $\mathbf{B}$. If we divide by the number of rows in $\mathbf{B}$, we get the mean of each column.

In this case, because we know how many rows there are, we could just plug that number in. But if the number of rows is unknown, we can use the shape method that arrays have to get it. The shape method returns a tuple:

In [48]:
B.shape
Out[48]:
(5, 3)

The number of rows is the first element, and the number of columns is the second. So, we can do this to get a vector containing the means of the columns of $\mathbf{B}$:

In [49]:
nr = B.shape[0] # get the number of rows
C/nr # divide C by nr to get column means
Out[49]:
array([[-0.07857731,  0.86636688, -7.90220725]])

As we saw above, arrays have a mean method built in, so we could have used that (and we can use it to verify that matrix multiplication is doing what we think it is doing). In this case, if we just call the mean method, it will give us the mean of the whole array:

In [50]:
B.mean()
Out[50]:
-2.371472560748852

As mentioned above, if we want the column means, we can specify the axis along which we want to calculate the mean. If we specify axis=0, the mean method will calculate the mean for each column (i.e., it will sum across the rows - the $0^{th}$ axis - recall that Python uses 0-indexing). If we specify axis=1, it will calculate the mean for each row (i.e., it will sum across the columns).

In [51]:
B
Out[51]:
array([[  0.97707905,   1.01632284,   1.62054943],
       [ -1.1673219 ,  -0.26774133,  -9.77654482],
       [  0.82841598,   1.02749529, -15.60779026],
       [ -0.35531221,  -1.42456215,   1.38852808],
       [ -0.67574747,   3.98031977, -17.13577869]])
In [52]:
B.mean(axis=0)
Out[52]:
array([-0.07857731,  0.86636688, -7.90220725])
In [53]:
B.mean(axis=1)
Out[53]:
array([ 1.20465044, -3.73720268, -4.58395967, -0.13044876, -4.61040213])

Also as mentioned above, we can caculate means and sums across higher dimensions if an array has them. Here are the "layer" means of d, the 3D array defined above (i.e., the means across the 3rd dimension, indexed by axis=2):

In [54]:
d.mean(axis=2)
Out[54]:
array([[ 2.,  7., 12., 17.],
       [22., 27., 32., 37.],
       [42., 47., 52., 57.]])

We will come back to linear algebra later in the semester.

Vectorized functions

Many of the functions in NumPy and SciPy can take vectors (or larger arrays) as inputs, perform operations on the vectors (arrays), and return vectors (arrays) as outputs.

Consider the pop method that lists have. They operate on the list from which the method was called, and they return the final element (and remove it from the list):

In [55]:
a = [1,2,3,4,5]
b = a.pop()
print("a is now",a,", while b is",b)
a is now [1, 2, 3, 4] , while b is 5

Or the function len, which takes a multi-element object and tells us how long it is:

In [56]:
len(a)
Out[56]:
4

A brief aside on arrays, scalars, and operators

We discussed basic arithmetic operators last week, illustrating their use with scalars (i.e., indvidual numbers). We can also use these operators with arrays or with operators and scalars together, but we need to be careful to understand how the shapes affect what the operators do.

If we want to add to arrays, element by element, the arrays need to have the same shape (with some exceptions, discussed below):

In [57]:
X = arange(4)
Y = linspace(0,1,4)
print(X)
print(Y)
print(X+Y)
[0 1 2 3]
[0.         0.33333333 0.66666667 1.        ]
[0.         1.33333333 2.66666667 4.        ]
In [58]:
Z = normal(size=5)
Y + Z
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-58-18eac6880453> in <module>()
      1 Z = normal(size=5)
----> 2 Y + Z

ValueError: operands could not be broadcast together with shapes (4,) (5,) 

The same issue applies if we want to multiply the two arrays element by element:

In [59]:
X * Y
Out[59]:
array([0.        , 0.33333333, 1.33333333, 3.        ])
In [60]:
X * Z
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-60-58ea3f4ce411> in <module>()
----> 1 X * Z

ValueError: operands could not be broadcast together with shapes (4,) (5,) 

The constraint described above is an oversimplification, since NumPy will add and multiply arrays with different shapes if the shapes are "compatible". This is discuseed in the NumPy documention on broadcasting.

Here is an example of broadcasting with the 1D, length 4 array X defined above and a new $3 \times 4$ array W. Because X has four elements and W has three rows, each with four elements, NumPy broadcasts, adding X element-by-element within each row of W:

In [61]:
W = arange(12).reshape((3,4))
In [62]:
X
Out[62]:
array([0, 1, 2, 3])
In [63]:
W
Out[63]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [64]:
X + W
Out[64]:
array([[ 0,  2,  4,  6],
       [ 4,  6,  8, 10],
       [ 8, 10, 12, 14]])

The same constraints apply to subtraction and division.

Finally, if you add, subtract, multiply, or divide a scalar and an array, the scalar is broadcast across all elements of the array:

In [65]:
X
Out[65]:
array([0, 1, 2, 3])
In [66]:
2 + X
Out[66]:
array([2, 3, 4, 5])
In [67]:
2 - X
Out[67]:
array([ 2,  1,  0, -1])
In [68]:
2 * X
Out[68]:
array([0, 2, 4, 6])
In [69]:
X / 2
Out[69]:
array([0. , 0.5, 1. , 1.5])

End brief aside

Contrast the functions pop() and len() with the NumPy function cos(), which takes a vector of radian values as input and returns a vector containing the cosine function values of each element of the input argument.

We will create an array of 1000 time values from 0 (inclusive) to 1 (exclusive) and then convert this to radians by multiplying by $2\pi$, using the NumPy builtin variable pi (we'll come back to radians and cos() and a number of related functions and issues later).

Here's the $\pi$ builtin variable:

In [70]:
np.pi
Out[70]:
3.141592653589793
In [71]:
t = linspace(0,1,1000,endpoint=False) # start=0, end=1, num=1000, exclude the endpoint
f = np.cos(2*np.pi*3*t) # the input argument is array t multiplied by scalar 2*np.pi*3
print(t.shape, f.shape)
(1000,) (1000,)
In [72]:
t[:5]
Out[72]:
array([0.   , 0.001, 0.002, 0.003, 0.004])
In [73]:
f[:5]
Out[73]:
array([1.        , 0.99982235, 0.99928947, 0.99840155, 0.9971589 ])

The function cos() took our (scalar multiplied by an) array input and returned an array. We can do plot f to see what it looks like (the 3 in the input argument to cos() gives our sinusoid a frequency of 3 Hz):

In [74]:
alpha = [.75] # opacity parameter
gray = [.25,.25,.25] # RGB color specification
fig, ax = pl.subplots(1,1,figsize=(12,4)) # create figure and axis objects
ax.plot(t,zeros(len(t)),':',color=gray+alpha) # plot zero line for reference
ax.plot(t,f,'r--', linewidth=3) # plot f as a function of t, using a red dashed line (r--)
ax.set_xlabel('t',fontsize=18) # set x label
ax.set_ylabel('cos(2$\pi$3t)',fontsize=18); # set y label; semicolon supresses output

Many (perhaps most) NumPy functions operate on and return vectors. We can take the absolute value of our variable f, for example:

In [75]:
g = np.abs(f) # absolute value
fig, ax = pl.subplots(1,1,figsize=(12,4))
ax.plot(t,zeros(len(t)),':',color=gray+alpha)
ax.plot(t,f,'r--',lw=3)
ax.plot(t,g,'g-',lw=1.5)
ax.set_xlabel('t',fontsize=18)
ax.set_ylabel('f(t), g(t)',fontsize=18);

We can scale f by multiplying it by a scalar:

In [76]:
fx2 = 2*f # double the amplitude of f
fig, ax = pl.subplots(1,1,figsize=(12,4))
ax.plot(t,zeros(len(t)),':',color=gray+alpha)
ax.plot(t,f,'r--',lw=3)
ax.plot(t,fx2,'b--',lw=2)
ax.set_xlabel('t',fontsize=18)
ax.set_ylabel('f(t), 2f(t)',fontsize=18);

As noted above, adding or subtracting arrays of the same size operates element-by-element.

In [77]:
fpg = f+g # f plus abs(f)
fmg = f-g # f minus abs(f)
fig, ax = pl.subplots(1,1,figsize=(12,4))
#ax.plot(t,zeros(len(t)),':',color=gray+alpha)
ax.plot(t,f,'r--',lw=3) # plot original function f
ax.plot(t,g,'g-',lw=1.5) # plot abs(f)
ax.plot(t,fpg,':',lw=4,color=[1,.5,.2,.8]) # plot f + abs(f) as orange dotted line
ax.plot(t,fmg,'-.',lw=3,color=[.5,0,1,.8]) # plot f - abs(f) as purple dash-dot line
ax.set_xlabel('t',fontsize=18)
ax.set_ylabel('f, g, f+g, f-g',fontsize=18);

As noted above, multiplication and division of arrays of the same size also operates element-by-element.

In [78]:
ftfx2 = f*fx2 # f times 2f
gdfx2 = g/fx2 # abs(f) divided by 2f
fig, axes = pl.subplots(2,1,figsize=(12,8))
ax1, ax2 = axes
# first axis
ax1.plot(t,zeros(len(t)),':',color=gray+alpha)
ax1.plot(t,f,'--',lw=3, color=[1,0,0,.8]) # red
ax1.plot(t,fx2,'-',lw=3, color=[0,0,1,.8]) # blue
ax1.plot(t,ftfx2,'-',lw=3,color=[1,0,1,.8]) # f times 2f in purple
ax1.set_ylabel('f, 2f, f*2f',fontsize=18)
# second axis
ax2.plot(t,zeros(len(t)),':',color=gray+alpha)
ax2.plot(t,g,'--',lw=3, color=[.5,.25,.125,.8]) # abs(f) in brown
ax2.plot(t,fx2,'-',lw=3, color=[0,.25,.5,.8]) # 2f in blue
ax2.plot(t,gdfx2,'o',ms=3, color=[.5,.5,.625,.8]) # abs(f) divided by 2f in gray-blue
ax2.set_xlabel('t',fontsize=18)
ax2.set_ylabel('g, 2f, g/2f',fontsize=18);

Introduction to logical (Boolean) indexing

We've talked about indexing, with lists, tuples, and arrays, and we've done so with numbers (integers, specifically). There is another very useful method for specifying (sub)sets of elements of arrays called logical indexing.

The basic idea of logical indexing is to use operators that return boolean variables to pick out elements that meet desired criteria.

So, for example, if you have a data set consisting of observations (e.g., test results of some sort) from children of different ages, and you want to just pick out the data for kids above a certain age, you can easily do so using logical indexing.

We can illustrate the basic idea with some of the variables we've already defined. Specifically, we can extract just the values of our cos function f that are greater than or equal to zero.

First, we use the >= operator to create a vector of True and False variables indicating where f is greater than or equal to 0 (and visualize a slice of f and of the Boolean array):

In [79]:
fge0 = f >= 0 # booleans indicating when f is non-negative
fge0[75:95] # values around a point at which f goes from positive to negative
Out[79]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False, False, False, False, False, False,
       False, False])

We can visualize the whole array of Booleans, noting that True maps to 1 and False to 0, plotting our Boolean array with f:

In [80]:
fig, ax = pl.subplots(1,1,figsize=(12,4))
ax.plot(zeros(len(f)),':',color=gray+alpha)
ax.plot(f,'r--',lw=3)
ax.plot(fge0,'g.',ms=3);

We can then use our boolean array fge0 to pick out all of the elements of f that are greater than or equal to zero. This is logical indexing.

In [81]:
fig, ax = pl.subplots(1,1,figsize=(12,4))
f_nn = f[fge0] # use logical indexing to pull out values of f that correspond to True in fge0
ax.plot(f_nn, linewidth=3)
f_nn.shape
Out[81]:
(500,)

Note that all of the elements of f that are less than zero are gone. Logical indexing pulls out whichever elements correspond to True values in the boolean array, dropping the elements that correspond to the False elements. Hence, f_nn is shorter than f.

To illustrate another example, let's take just the values from f that are greater than $\pm 0.35$:

In [82]:
fg35 = np.abs(f) >= .35 # True when abs(f) is greater than or equal to 0.35
fig, ax = pl.subplots(1,1,figsize=(12,4))
ax.plot(zeros(len(f)),':',color=gray+alpha)
ax.plot(f,'r--',lw=3)
ax.plot(fg35,'g.',ms=3);
In [83]:
fig, ax = pl.subplots(1,1,figsize=(12,4))
f_ss = f[fg35] # use logical indexing to pull out values of f that correspond to True in fg35
ax.plot(f_ss,'o',ms=2)
f_ss.shape
Out[83]:
(774,)

Brief notes on SciPy

We've covered some basic uses of (parts of) NumPy and Matplotlib so far. SciPy is another very useful package (with a large number of component parts). We can peak into one tiny corner of SciPy by looking at (part of) the stats module.

Specifically, we'll look at kernel density estimation (roughly, a method for getting a smoothed histogram). We will come back to look at kernal density estimation in more detail later.

In [84]:
from scipy.stats import gaussian_kde

kde_n = gaussian_kde(n,bw_method='silverman')

fig, ax = pl.subplots(1,1,figsize=(12,6))
hh = ax.hist(n,bins=200,histtype='step',lw=2,normed=True)
x_eval = linspace(-40,80,250)
ax.plot(x_eval,kde_n(x_eval),lw=4,alpha=.65);

Extremely Brief Summary

Modules and packages provide easy-to-access reuseable code. NumPy, SciPy, and Matplotlib are extremely useful modules, with a large number of very useful functions for scientific research. We'll be using all three extensively throughout the semester, along with some other packages (e.g., Pandas, Scikit-Learn).

In addition, we talked about matrix multiplication using the @ operator, broadcasting with other basic operators, and logical indexing. All three will come in handy at various points throughout the semester.