import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
# This stops a few warning messages from showing
= None
pd.options.mode.chained_assignment import warnings
='ignore', category=FutureWarning)
warnings.simplefilter(action
# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
Introduction to Data Science
Classifying Categorical Data - Logistic Regression
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours: Duke 209 Click Here for Joanna’s Schedule
Announcements
Come to Lab! If you need help we are here to help!
Day 18 Assignment - same drill.
- Make sure you can Fork and Clone the Day18 repo from Redlands-DATA101
- Open the file Day18-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
- Get as far as you can before class.
- Submit what you have so far Commit and Push to Git.
- Take the daily check in quiz on Canvas.
- Come to class with lots of questions!
Data: A collection of Emails
- Emails for the first three months of 2012 for an email account
- Data from 3921 emails and 21 variables on them
- Outcome: whether the email is spam or not
- Predictors: number of characters, whether the email had “Re:” in the subject, time at which email was sent, number of times the word “inherit” shows up in the email, etc.
Data Information: https://www.openintro.org/data/index.php?data=email
This lab follows the Data Science in a Box units “Unit 4 - Deck 6: Logistic regression” by Mine Çetinkaya-Rundel. It has been updated for our class and translated to Python by Joanna Bieri.
= 'data/email.csv'
file_name = pd.read_csv(file_name) DF
DF
spam | to_multiple | from | cc | sent_email | time | image | attach | dollar | winner | ... | viagra | password | num_char | line_breaks | format | re_subj | exclaim_subj | urgent_subj | exclaim_mess | number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 2012-01-01T06:16:41Z | 0 | 0 | 0 | no | ... | 0 | 0 | 11.370 | 202 | 1 | 0 | 0 | 0 | 0 | big |
1 | 0 | 0 | 1 | 0 | 0 | 2012-01-01T07:03:59Z | 0 | 0 | 0 | no | ... | 0 | 0 | 10.504 | 202 | 1 | 0 | 0 | 0 | 1 | small |
2 | 0 | 0 | 1 | 0 | 0 | 2012-01-01T16:00:32Z | 0 | 0 | 4 | no | ... | 0 | 0 | 7.773 | 192 | 1 | 0 | 0 | 0 | 6 | small |
3 | 0 | 0 | 1 | 0 | 0 | 2012-01-01T09:09:49Z | 0 | 0 | 0 | no | ... | 0 | 0 | 13.256 | 255 | 1 | 0 | 0 | 0 | 48 | small |
4 | 0 | 0 | 1 | 0 | 0 | 2012-01-01T10:00:01Z | 0 | 0 | 0 | no | ... | 0 | 2 | 1.231 | 29 | 0 | 0 | 0 | 0 | 1 | none |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3916 | 1 | 0 | 1 | 0 | 0 | 2012-03-31T00:03:45Z | 0 | 0 | 0 | no | ... | 0 | 0 | 0.332 | 12 | 0 | 0 | 0 | 0 | 0 | small |
3917 | 1 | 0 | 1 | 0 | 0 | 2012-03-31T14:13:19Z | 0 | 0 | 1 | no | ... | 0 | 0 | 0.323 | 15 | 0 | 0 | 0 | 0 | 0 | small |
3918 | 0 | 1 | 1 | 0 | 0 | 2012-03-30T16:20:33Z | 0 | 0 | 0 | no | ... | 0 | 0 | 8.656 | 208 | 1 | 0 | 0 | 0 | 5 | small |
3919 | 0 | 1 | 1 | 0 | 0 | 2012-03-28T16:00:49Z | 0 | 0 | 0 | no | ... | 0 | 0 | 10.185 | 132 | 0 | 0 | 0 | 0 | 0 | small |
3920 | 1 | 0 | 1 | 0 | 0 | 2012-03-31T09:20:24Z | 0 | 0 | 2 | yes | ... | 0 | 0 | 2.225 | 65 | 0 | 0 | 1 | 0 | 1 | small |
3921 rows × 21 columns
DF.shape
(3921, 21)
DF.columns
Index(['spam', 'to_multiple', 'from', 'cc', 'sent_email', 'time', 'image',
'attach', 'dollar', 'winner', 'inherit', 'viagra', 'password',
'num_char', 'line_breaks', 'format', 're_subj', 'exclaim_subj',
'urgent_subj', 'exclaim_mess', 'number'],
dtype='object')
DF.describe()
spam | to_multiple | from | cc | sent_email | image | attach | dollar | inherit | viagra | password | num_char | line_breaks | format | re_subj | exclaim_subj | urgent_subj | exclaim_mess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 | 3921.000000 |
mean | 0.093599 | 0.158123 | 0.999235 | 0.404489 | 0.277990 | 0.048457 | 0.132874 | 1.467228 | 0.038001 | 0.002040 | 0.108136 | 10.706586 | 230.658505 | 0.695231 | 0.261413 | 0.080337 | 0.001785 | 6.584290 |
std | 0.291307 | 0.364903 | 0.027654 | 2.666424 | 0.448066 | 0.450848 | 0.718518 | 5.022298 | 0.267899 | 0.127759 | 0.959931 | 14.645786 | 319.304959 | 0.460368 | 0.439460 | 0.271848 | 0.042220 | 51.479871 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.001000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.459000 | 34.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.856000 | 119.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
75% | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.084000 | 298.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 4.000000 |
max | 1.000000 | 1.000000 | 1.000000 | 68.000000 | 1.000000 | 20.000000 | 21.000000 | 64.000000 | 9.000000 | 8.000000 | 28.000000 | 190.087000 | 4022.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1236.000000 |
DF.dtypes
spam int64
to_multiple int64
from int64
cc int64
sent_email int64
time object
image int64
attach int64
dollar int64
winner object
inherit int64
viagra int64
password int64
num_char float64
line_breaks int64
format int64
re_subj int64
exclaim_subj int64
urgent_subj int64
exclaim_mess int64
number object
dtype: object
How do we model categorical data for prediction
Today we will learn how to do binary classification. Can we predict whether or not an email is spam - we are building a simple spam filter.
Given this data, what do you think might be predictive of a spam email?
Would you expect longer or shorter emails to be spam?
Would you expect emails that have subjects starting with “Re:”, “RE:”, “re:”, or “rE:” to be spam or not?
Analyze the number of characters in an email
'num_char','spam']].groupby('spam').describe() DF[[
num_char | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
spam | ||||||||
0 | 3554.0 | 11.250517 | 14.510758 | 0.003 | 1.92525 | 6.831 | 15.5075 | 190.087 |
1 | 367.0 | 5.439204 | 14.920101 | 0.001 | 0.47250 | 1.046 | 3.2905 | 173.956 |
= px.histogram(DF,
fig ='num_char',
x='spam',
color='spam',
facet_col=50,
nbins='probability density',
histnorm="box",
marginal=px.colors.qualitative.Safe)
color_discrete_sequence
="ggplot2",
fig.update_layout(template=0.02,
bargap='Spam vs. Number of Characters',
title=0.5,
title_x=False,
showlegend=False,
autosize=800,
width=500)
height
fig.show()
What do you notice?
- Both distributions are very right skewed.
- Both contain some outliers.
- Spam messages have a much higher probability of being very short.
Analyze the subject title of the emails
Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
're_subj'].value_counts() DF[
re_subj
0 2896
1 1025
Name: count, dtype: int64
= px.histogram(DF,
fig ='re_subj',
x='spam',
color='percent',
barnorm=px.colors.qualitative.Safe)
color_discrete_sequence
="ggplot2",
fig.update_layout(template='Spam vs "re." in Subject',
title=0.5,
title_x= 0.1,
bargap ="",
yaxis_title='Re in Subject',
xaxis_title='Spam',
legend_title=False,
autosize=800,
width=500)
height
fig.show()
What do you notice?
- Spam messages tend to not have ‘Re’ in the email
- But a small number of spam messages to have ‘Re’ added to the subject line.
Modeling spam
- Both number of characters and whether the message has “re:” in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship?
- For simplicity, we’ll focus on the number of characters (
num_char
) as predictor, but the model we describe can be expanded to take multiple predictors as well. - This isn’t something we can reasonably fit a linear model to – we need something different!
Why not?
# Get a subset of the rows
= DF[['num_char','spam']]
DF_model
# Get the variables
= DF_model['num_char'].values.reshape(-1,1)
X = DF_model['spam']
y
# Do the regression
= LinearRegression()
LM
LM.fit(X,y)
# Save the predicted values to the data frame
'prediction'] = LM.predict(X)
DF_model[
# Plot the results
= px.scatter(DF_model,x='num_char',y='spam')
fig
fig.add_trace(='num_char', y='prediction',color_discrete_sequence=['black']).data[0]
px.line(DF_model, x
)
fig.show()
print(LM.coef_)
print(LM.intercept_)
print(LM.score(X,y))
[-0.00229906]
0.11821360854534321
0.013360530323788145
- What would it even me to use this model?
- Should the output of the model be numbers other than 0 or 1?
Logistic Regression - Framing the problem
We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
Bernoulli trial: a random experiment (flipping a coin) with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted
Each Bernoulli trial can have a separate probability of success
We can then use the predictor variables (number of characters) to model that probability of success, \(p_i\), (it is spam)
We can’t just use a linear model for \(p_i\) (since \(p_i\) must be between 0 and 1) but we can transform the linear model to have the appropriate range.
Generalized linear models (GLM)
This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)
Logistic regression is just one example
Three characteristics of GLMs|
All GLMs have the following three characteristics:
A probability distribution describing a generative model for the outcome variable
A linear model \[\eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]
A link function that relates the linear model to the parameter of the outcome distribution. Relate the value of \(\eta\) back to the probability of success.
Logistic regression
Logistic regression is a GLM used to model a binary categorical outcome (two possible answers) using numerical and categorical predictors
To finish specifying the Logistic model we just need to define a reasonable link function that connects \(\eta_i\) to \(p_i\): logit function
Logit function: For \(0\le p \le 1\)
\[logit(p) = \log\left(\frac{p}{1-p}\right)\]
= np.arange(0.01,1,.01)
p = np.log(p/(1-p))
lp
= px.line(x=p,y=lp,range_x=[-.01, 1.01])
fig
='Logit Function',
fig.update_layout(title=0.5,
title_x= 'ggplot2',
template ="logit(p)",
yaxis_title='p'
xaxis_title
)
fig.show()
Properties of the logit
The logit function takes a value between 0 and 1 and maps it to a value between \(-\infty\) and \(\infty\)
Inverse logit (logistic) function: \[g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}\]
The inverse logit function takes a value between \(-\infty\) and \(\infty\) and maps it to a value between 0 and 1
This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success
= np.arange(-10,10,.01)
p = 1/(1+np.exp(-p))
logistic
= px.line(x=p,y=logistic,range_y=[-.01, 1.01])
fig
='Logistic Function',
fig.update_layout(title=0.5,
title_x= 'ggplot2',
template ="logistic(p)",
yaxis_title='p'
xaxis_title
)
fig.show()
The logistic regression model
- Based on the three GLM criteria we have
- \(y_i \sim \text{Bern}(p_i)\) - we have an experiment that is a result of a Bernoulli Trial
- \(\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}\) - We have a linear model
- \(\text{logit}(p_i) = \eta_i\) - We have a linking function
- From which we get
\[p_i = \frac{e^{(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}{1+e^{(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}\]
This looks a little scary, but all we did was plug our linear model into the logistic function! This means we can predict the probability of our \(i^{th}\) experiment.
Building a Logistic Regression Model in Python
# Get a subset of the rows
= DF[['num_char','spam']]
DF_model
# Get the variables
= DF_model['num_char'].values.reshape(-1,1)
X = DF_model['spam']
y
# Do the regression
= LogisticRegression()
LM
LM.fit(X,y)
print('Classes:')
print(LM.classes_)
print('Coefficients:')
print(LM.coef_)
print('Intercept:')
print(LM.intercept_)
Classes:
[0 1]
Coefficients:
[[-0.06207082]]
Intercept:
[-1.79876282]
What does this output mean?
Well we have a slope and an intercept for our linear model:
\[\eta = = -1.80-0.0621(numchar)\]
plug this into our logit function
\[\log\left(\frac{p}{1-p}\right) = -1.80-0.0621(numchar)\]
solve for the probability - exponent to get rid of the log()
\[\frac{p}{1-p} = e^{-1.80-0.0621(numchar)}\]
\[p = \frac{e^{-1.80-0.0621(numchar)}}{1+e^{-1.80-0.0621(numchar)}}\]
What is the probability that an email with 2000 characters is spam?
= 2
num = -1.80
intercept = -0.0621
slope
= intercept + slope*num
eta
= np.exp(eta)/(1+np.exp(eta))
P print(P)
0.1273939433505534
What is the probability that an email with 40000 characters is spam?
= 40
num = -1.80
intercept = -0.0621
slope
= intercept + slope*num
eta
= np.exp(eta)/(1+np.exp(eta))
P print(P)
0.013599894814523065
Model probability as a function of number of characters
= -1.80
intercept = -0.0621
slope
= []
P
= list(np.arange(0,200,10))
num_characters
for num in num_characters:
= intercept + slope*num
eta /(1+np.exp(eta)))
P.append(np.exp(eta)
# Plot the results
= px.scatter(DF_model,x='num_char',y='spam',opacity=.5)
fig
fig.add_trace(=num_characters,y=P,color_discrete_sequence=['black']).data[0]
px.line(x
)
fig.show()
What do we see here
- The probability of being spam decreases as the number of characters increases.
- The probability of spam overall is really low!
- What should the cutoff be?
Would you prefer an email with 2000 characters to be labeled as spam or not? How about 40,000 characters?
- There is a 12.7% probability that an email with 2000 characters is spam
- There is a 1.4% probability that an email with 40,000 characters is spam
Should any of these skip the inbox?
Sensitivity and specificity
False positive and negative
Email is spam | Email is not spam | |
---|---|---|
Email labelled spam | True positive | False positive (Type 1 error) |
Email labelled not spam | False negative (Type 2 error) | True negative |
False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)
False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)
Sensitivity and specificity
Email is spam | Email is not spam | |
---|---|---|
Email labelled spam | True positive | False positive (Type 1 error) |
Email labelled not spam | False negative (Type 2 error) | True negative |
Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)
Sensitivity = 1 − False negative rate
Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)
Specificity = 1 − False positive rate
If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?
Python Confusion Matrix
Classes = [0,1] = [not spam, spam]
Result is 0 | Result is not 0 | |
---|---|---|
Results labeled 0 | True Positive | False positive |
Result not labeled 0 | False negative | True negative |
# import the metrics class
from sklearn import metrics
'prediction'] = LM.predict(X)
DF_model[
= DF_model['prediction'].values
y_pred
= metrics.confusion_matrix(y, y_pred)
cnf_matrix cnf_matrix
array([[3554, 0],
[ 367, 0]])
True positive = 3554 | False positive = 0 | |
False negative = 367 | True negative =0 |
= cnf_matrix[0,0] # The email was not spam and was correctly labeled 0
TP = cnf_matrix[0,1] # The email was spam and was incorrectly labeled as 0
FP = cnf_matrix[1,0] # The email was not spam at was incorrectly labeled 1
FN = cnf_matrix[1,1] # The email was spam and was correctly labeled 1 TN
What does this mean
- This means we predicted that nothing was spam.
- By default LogisticRegression() and .predict() use a 50% cutoff!
Is this what we want?
Change your decision probabilities
We can look into the .predict_proba() and use some programming to classify things.
.predict_proba() returns the probability of each class for each observation. In this case with will be two colums the first being the probability that we got 0 the second being the probability that we got 1 for each row in the data frame.
# Create new columns in your data frame
'prob not spam','prob spam']] = LM.predict_proba(X) DF_model[[
# Look at the model DF
DF_model
num_char | spam | prediction | prob not spam | prob spam | |
---|---|---|---|---|---|
0 | 11.370 | 0 | 0 | 0.924457 | 0.075543 |
1 | 10.504 | 0 | 0 | 0.920617 | 0.079383 |
2 | 7.773 | 0 | 0 | 0.907311 | 0.092689 |
3 | 13.256 | 0 | 0 | 0.932237 | 0.067763 |
4 | 1.231 | 0 | 0 | 0.867056 | 0.132944 |
... | ... | ... | ... | ... | ... |
3916 | 0.332 | 1 | 0 | 0.860491 | 0.139509 |
3917 | 0.323 | 1 | 0 | 0.860423 | 0.139577 |
3918 | 8.656 | 0 | 0 | 0.911819 | 0.088181 |
3919 | 10.185 | 0 | 0 | 0.919157 | 0.080843 |
3920 | 2.225 | 1 | 0 | 0.874008 | 0.125992 |
3921 rows × 5 columns
# Choose your cutoff
= .1 #if prob spam is 5% or higher it is labeled spam
cutoff
# Use a lambda on the new columns to create your new prediction
'new_prediction'] = DF_model['prob spam'].apply(lambda x: 1 if x>cutoff else 0)
DF_model[
# Look at the model DF
DF_model
num_char | spam | prediction | prob not spam | prob spam | new_prediction | |
---|---|---|---|---|---|---|
0 | 11.370 | 0 | 0 | 0.924457 | 0.075543 | 0 |
1 | 10.504 | 0 | 0 | 0.920617 | 0.079383 | 0 |
2 | 7.773 | 0 | 0 | 0.907311 | 0.092689 | 0 |
3 | 13.256 | 0 | 0 | 0.932237 | 0.067763 | 0 |
4 | 1.231 | 0 | 0 | 0.867056 | 0.132944 | 1 |
... | ... | ... | ... | ... | ... | ... |
3916 | 0.332 | 1 | 0 | 0.860491 | 0.139509 | 1 |
3917 | 0.323 | 1 | 0 | 0.860423 | 0.139577 | 1 |
3918 | 8.656 | 0 | 0 | 0.911819 | 0.088181 | 0 |
3919 | 10.185 | 0 | 0 | 0.919157 | 0.080843 | 0 |
3920 | 2.225 | 1 | 0 | 0.874008 | 0.125992 | 1 |
3921 rows × 6 columns
# Redo the confusion matrix
= DF_model['new_prediction'].values
y_pred
= metrics.confusion_matrix(y, y_pred)
cnf_matrix cnf_matrix
array([[1831, 1723],
[ 62, 305]])
= cnf_matrix[0,0]
TP = cnf_matrix[0,1]
FP = cnf_matrix[1,0]
FN = cnf_matrix[1,1]
TN
print('False Negative Rate:')
print(FN/ (TP+FN))
print('------')
print('False Positive Rate:')
print(FP/ (FP+TN))
print('------')
print('Sensitivity:')
print(1 - (FN/ (TP+FN)))
print('------')
print('Specificity:')
print(1- (FP/ (FP+TN)))
False Negative Rate:
0.032752245113576335
------
False Positive Rate:
0.8496055226824457
------
Sensitivity:
0.9672477548864237
------
Specificity:
0.1503944773175543
Remember to interpret your results!
False negative means an email not spam and it was labeled spam. This happened in 3.3% of cases where the email was not spam.
False positive means that am email was spam and was labeled not spam. This happened in 85% of the cases where the email was spam.
If the email was not spam it has a 96.7% probability of being labeled not spam.
If the email was spam it has a 15.1% probability of being labeled spam.
Exercise 1 Logistic Regression with ONE explanatory variable.
Choose another variable from the data set to use as your explanatory variable and create a Logistic Regression model to predict if an email is spam or not. You should do all of the following:
- Say what variable you are using to predict spam messages (do some analysis, at minimum a value_counts()). Why do you think this is a good variable to use in predicting if an email is spam.
- Create and fit a Logistic Regression model.
- Show the results: intercept, coefficient, basic confusion matrix prediction.
- What do you think the decision cutoff should be? Update the cutoff and redo the confusion matrix.
- Explain your results in words. You should talk about False Negative and False positive rates and what they mean in terms of the variables you chose.
Exercise 2 - challenge Logistic Regression with MORE THAN ONE explanatory variable.
Try redoing the analysis, but this time add a few more explanatory variables. Again do some analysis of the variables you are chosing and state why they are a good choice. Then answer again questions 1-5.