import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
# This stops a few warning messages from showing
= None
pd.options.mode.chained_assignment import warnings
='ignore', category=FutureWarning)
warnings.simplefilter(action
# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
Introduction to Data Science
Modeling with Multiple Predictors
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours: Duke 209 Click Here for Joanna’s Schedule
Announcements
Come to Lab! If you need help we are here to help!
Day 17 Assignment - same drill.
- Make sure you can Fork and Clone the Day17 repo from Redlands-DATA101
- Open the file Day17-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
- Get as far as you can before class.
- Submit what you have so far Commit and Push to Git.
- Take the daily check in quiz on Canvas.
- Come to class with lots of questions!
Data: Book weight and volume
The allbacks
data frame gives measurements on the volume and weight of 15 books, some of which are paperback and some of which are hardback
- Volume - cubic centimetres
- Area - square centimetres
- Weight - grams
These books are from the bookshelf of J. H. Maindonald at Australian National University.
This lab follows the Data Science in a Box units “Models with Multiple Predictors” by Mine Çetinkaya-Rundel. It has been updated for our class and translated to Python by Joanna Bieri.
= 'data/allbacks.csv'
file_name = pd.read_csv(file_name) DF
DF
volume | area | weight | cover | |
---|---|---|---|---|
0 | 885 | 382 | 800 | hb |
1 | 1016 | 468 | 950 | hb |
2 | 1125 | 387 | 1050 | hb |
3 | 239 | 371 | 350 | hb |
4 | 701 | 371 | 750 | hb |
5 | 641 | 367 | 600 | hb |
6 | 1228 | 396 | 1075 | hb |
7 | 412 | 0 | 250 | pb |
8 | 953 | 0 | 700 | pb |
9 | 929 | 0 | 650 | pb |
10 | 1492 | 0 | 975 | pb |
11 | 419 | 0 | 350 | pb |
12 | 1010 | 0 | 950 | pb |
13 | 595 | 0 | 425 | pb |
14 | 1034 | 0 | 725 | pb |
Start with a simple linear model
We will try to predict a books weight using the volume. Before we do, what are some things we should consider? Are there variables outside of weight and volume that might affect our results?
DF.columns
Index(['volume', 'area', 'weight', 'cover'], dtype='object')
# Get a subset of the rows
= DF[['weight','volume','cover']]
DF_model
# Get the variables
= DF_model['volume'].values.reshape(-1,1)
X = DF_model['weight']
y
# Do the regression
= LinearRegression()
LM
LM.fit(X,y)
# Save the predicted values to the data frame
'prediction'] = LM.predict(X)
DF_model[
# Plot the results
= px.scatter(DF_model,x='volume',y='weight')
fig
fig.add_trace(='volume', y='prediction',color_discrete_sequence=['black']).data[0]
px.line(DF_model, x
)
fig.show()
LM.coef_
array([0.70863714])
LM.intercept_
107.679310613766
LM.score(X,y)
0.8026345746312898
Interpret this result
For every square centimeter the book is larger we expect the weight of the book to increase by 0.7086 grams.
If a book has zero volume then we expect it to have a weight of 107.68 grams. This does not make sense! Why not? Because we are extrapolating here. Notice none of our books are even close to zero volume!
What else might affect the weight?
Well whether or not the book is hardback or paperback would definitely change the weight! We will color our points by the cover type.
= px.scatter(DF_model,x='volume',y='weight',color='cover')
fig
fig.add_trace(='volume', y='prediction',color_discrete_sequence=['black']).data[0]
px.line(DF_model, x
)
="ggplot2",
fig.update_layout(template='Weight as a function of Volume',
title=0.5,
title_x='cover type')
legend_title
fig.show()
Do we notice a trend here? Well, it seems that more often the hard bound books are above the prediction line! How could we fix this?
Add another explanatory variable!
First we notice that cover is a categorical variable that can take just two values:
'cover'].value_counts() DF_model[
cover
pb 8
hb 7
Name: count, dtype: int64
So we need to decide how to encode this variable. Since there are only two options I we could encode pb=1 and hb=0 in our data frame using a command like
DF_model['cover_enc'] = DF_model['cover'].apply(lambda x: 1 if x == 'pb' else 0)
Instead I am going to show you a new and very usefull command:
pd.get_dummies(DF,columns=[],dtype=float)
The get dummies command creates new columns in your data frame for each possible category in the columns you provide. In our case we are going to say
DF_model = pd.get_dummies(DF_model, columns=['cover'], dtype=float)
and it will create two new columns in our data frame one for cover_hb and the other for cover_pb
= pd.get_dummies(DF_model, columns=['cover'],dtype=float) DF_model
DF_model.dtypes
weight int64
volume int64
prediction float64
cover_hb float64
cover_pb float64
dtype: object
# Redo our variables, this time X has more than one column!
= DF_model[['volume','cover_hb','cover_pb']].values.reshape(-1,3)
X = DF_model['weight']
y
# Do the regression
= LinearRegression()
LM
LM.fit(X,y)
# Save the new prediction to our data frame
'prediction_new'] = LM.predict(X) DF_model[
LM.coef_
array([ 0.71795374, 92.02363569, -92.02363569])
LM.intercept_
105.93920788192202
LM.score(X,y)
0.927477575682168
Interpret the result
How do we interpret this? Well you separate each of the explanatory variables by a plus sign… we add them up. In general now the model looks like
\[ y = b_0 + b_1 X1 + b_2 X2 + b_3 X3 + b_4 X4.... \]
where \(X1, X2, X3, ....\) are each of your explanatory variables. For us
\(X1\) = volume
\(X2\) = cover type = hb
\(X3\) = cover type = pb
so our equation is
\[ weight = 105.939 + 0.718(volume) + 92.023(hardbound) - 92.023 (paperback) \]
The variables \(hardbound\) and \(paperback\) take values of zero or one depending on the book type.
What does this tell me:
- All else held constant, on average paperbacks are 184.046 grams lighter weight than hardback books.
- All else held constant, if you increase the volume by one square centimeter then on average it’s weight will increase by 0.718 grams.
- Hardbound books with zero volume are expected to weigh 197 grams (extrapolated).
- Paperback books with zero volume are expected to weigh 13 grams (extrapolated).
- We also see that we get a better \(R^2\) when we use two explanatory variables!
Let’s look at a plot of this data. Here I plotted two separate lines one for hard bound and the other for paper back.
# Notice here I had to get the color = 'cover' from the original data frame.
= px.scatter(DF_model,x='volume',y='weight',color=DF['cover'])
fig
# Prediction for when cover_enc = 0
= DF_model['cover_hb']==1
mask
fig.add_trace(='volume', y='prediction_new',color_discrete_sequence=['lightblue']).data[0]
px.line(DF_model[mask], x
)
# Prediction for when cover_enc = 1
= DF_model['cover_pb']==1
mask
fig.add_trace(='volume', y='prediction_new',color_discrete_sequence=['pink']).data[0]
px.line(DF_model[mask], x
)
="ggplot2",
fig.update_layout(template='Weight as a function of Volume',
title=0.5,
title_x='cover type')
legend_title
fig.show()
Types of effects
- A main effect refers to the independent impact of one variable on the dependent variable. This means that each of our variables are in linear combination:
\[ weight = 197.963 + 0.718(volume) - 184.047(cover type) \]
so the weight changes at the same rate regardless of whether or not we are modeling paperback or hardback books. In both cases as the volume increase the weight increases by 0.718. Does this make sense?
- An interaction effect occurs when the effect of one variable changes depending on the level of another variable.
Maybe we should have different slopes for theses lines. Do hardback book get heavier per square centimeter?
Should we try to account for this?
Should we add interaction effects?
In pursuit of Occam’s razor
Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.
Model selection follows this principle.
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.
In other words, we prefer the simplest best model.
Try modeling an interaction
We can do this by engineering some new features (variables) in our data. Let’s try adding a variable to our model that is the cover type times the volume. The interaction between cover_hb and volume.
'cov_vol'] = DF_model['cover_hb']*DF_model['volume'] DF_model[
Now we will train a new linear model using three explanatory variables:
\[ weight = b_0 + b_1(volume) +b_2(hardbound) + b_3(paperback) + b_4(hardbound * volume) \]
# This model has THREE explanatory variables
= DF_model[['volume','cover_hb', 'cover_pb','cov_vol']].values.reshape(-1,4)
X = DF_model['weight']
y
= LinearRegression()
LM
LM.fit(X,y)
'prediction_new_int'] = LM.predict(X) DF_model[
=True)
np.set_printoptions(suppress LM.coef_
array([ 0.68585918, 60.10703285, -60.10703285, 0.07573366])
LM.intercept_
101.4795085635086
LM.score(X,y)
0.9297136949284579
= px.scatter(DF_model,x='volume',y='weight',color=DF['cover'])
fig
# Prediction for when cover_enc = 0
= DF_model['cover_hb']==1
mask
fig.add_trace(='volume', y='prediction_new_int',color_discrete_sequence=['lightblue']).data[0]
px.line(DF_model[mask], x
)
# Prediction for when cover_enc = 1
= DF_model['cover_pb']==1
mask
fig.add_trace(='volume', y='prediction_new_int',color_discrete_sequence=['pink']).data[0]
px.line(DF_model[mask], x
)
="ggplot2",
fig.update_layout(template='Weight as a function of Volume',
title=0.5,
title_x='cover type')
legend_title
fig.show()
How much did this help?
- The graphs look very similar. In the second graph the two lines have different slopes with the hardbound slope being slightly larger.
- The two models have similar scores:
- Main Effect Model Score = 0.9274775756821679
- Interaction Effect Model Score = 0.9297136949284579
- The interaction model does a slightly better job of explaining the variability in the response variable but it adds complexity.
What does Occam’s Razor say?
\(R^2\) does a good job at looking at one model, but not a great job of comparing models because more variables will tend to make \(R^2\) better. We need to account for the number of variables and the number of observations.
Adjusted \(R^2\)
\[adjR^2 = 1 – \left( (1-R^2)\frac{(n-1)}{(n-k-1)}\right)\]
where
- R2: The R2 of the model
- n: The number of observations
- k: The number of predictor variables
This assigned a penalty for putting in lots of extra variables
In our case \(n=15\) or len(DF), \(k=3\) for the first model and \(k=4\) for the second model.
# Model 1
= 0.9274775756821679
Rsq = 3
k = len(DF)
n = 1-((1-Rsq)*(n-1)/(n-k-1))
AdjRsq
print(AdjRsq)
0.9076987326863956
# Model 2
= 0.9297136949284579
Rsq = 4
k = len(DF)
n = 1-((1-Rsq)*(n-1)/(n-k-1))
AdjRsq
print(AdjRsq)
0.901599172899841
Adjusted \(R^2\) results
We see that model 2 has a lower adjusted \(R^2\) value so we are probably better sticking with the original model. It is simpler (uses fewer variables) and gets a good \(R^2\) value!