Introduction to Data Science

Classifying Categorical Data - Logistic Regression

Author

Joanna Bieri
DATA101

Important Information

Email: joanna_bieri@redlands.edu
Office Hours: Duke 209 Click Here for Joanna’s Schedule

Announcements

Come to Lab! If you need help we are here to help!

Day 18 Assignment - same drill.

Make sure you can Fork and Clone the Day18 repo from Redlands-DATA101
Open the file Day18-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
Get as far as you can before class.
Submit what you have so far Commit and Push to Git.
Take the daily check in quiz on Canvas.
Come to class with lots of questions!

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression

Data: A collection of Emails

Emails for the first three months of 2012 for an email account
Data from 3921 emails and 21 variables on them
Outcome: whether the email is spam or not
Predictors: number of characters, whether the email had “Re:” in the subject, time at which email was sent, number of times the word “inherit” shows up in the email, etc.

Data Information: https://www.openintro.org/data/index.php?data=email

This lab follows the Data Science in a Box units “Unit 4 - Deck 6: Logistic regression” by Mine Çetinkaya-Rundel. It has been updated for our class and translated to Python by Joanna Bieri.

file_name = 'data/email.csv'
DF = pd.read_csv(file_name)

DF

	spam	to_multiple	from	cc	sent_email	time	image	attach	dollar	winner	...	viagra	password	num_char	line_breaks	format	re_subj	exclaim_subj	urgent_subj	exclaim_mess	number
0	0	0	1	0	0	2012-01-01T06:16:41Z	0	0	0	no	...	0	0	11.370	202	1	0	0	0	0	big
1	0	0	1	0	0	2012-01-01T07:03:59Z	0	0	0	no	...	0	0	10.504	202	1	0	0	0	1	small
2	0	0	1	0	0	2012-01-01T16:00:32Z	0	0	4	no	...	0	0	7.773	192	1	0	0	0	6	small
3	0	0	1	0	0	2012-01-01T09:09:49Z	0	0	0	no	...	0	0	13.256	255	1	0	0	0	48	small
4	0	0	1	0	0	2012-01-01T10:00:01Z	0	0	0	no	...	0	2	1.231	29	0	0	0	0	1	none
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3916	1	0	1	0	0	2012-03-31T00:03:45Z	0	0	0	no	...	0	0	0.332	12	0	0	0	0	0	small
3917	1	0	1	0	0	2012-03-31T14:13:19Z	0	0	1	no	...	0	0	0.323	15	0	0	0	0	0	small
3918	0	1	1	0	0	2012-03-30T16:20:33Z	0	0	0	no	...	0	0	8.656	208	1	0	0	0	5	small
3919	0	1	1	0	0	2012-03-28T16:00:49Z	0	0	0	no	...	0	0	10.185	132	0	0	0	0	0	small
3920	1	0	1	0	0	2012-03-31T09:20:24Z	0	0	2	yes	...	0	0	2.225	65	0	0	1	0	1	small

3921 rows × 21 columns

DF.shape

(3921, 21)

DF.columns

Index(['spam', 'to_multiple', 'from', 'cc', 'sent_email', 'time', 'image',
       'attach', 'dollar', 'winner', 'inherit', 'viagra', 'password',
       'num_char', 'line_breaks', 'format', 're_subj', 'exclaim_subj',
       'urgent_subj', 'exclaim_mess', 'number'],
      dtype='object')

DF.describe()

	spam	to_multiple	from	cc	sent_email	image	attach	dollar	inherit	viagra	password	num_char	line_breaks	format	re_subj	exclaim_subj	urgent_subj	exclaim_mess
count	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000	3921.000000
mean	0.093599	0.158123	0.999235	0.404489	0.277990	0.048457	0.132874	1.467228	0.038001	0.002040	0.108136	10.706586	230.658505	0.695231	0.261413	0.080337	0.001785	6.584290
std	0.291307	0.364903	0.027654	2.666424	0.448066	0.450848	0.718518	5.022298	0.267899	0.127759	0.959931	14.645786	319.304959	0.460368	0.439460	0.271848	0.042220	51.479871
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.001000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.459000	34.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.856000	119.000000	1.000000	0.000000	0.000000	0.000000	1.000000
75%	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	14.084000	298.000000	1.000000	1.000000	0.000000	0.000000	4.000000
max	1.000000	1.000000	1.000000	68.000000	1.000000	20.000000	21.000000	64.000000	9.000000	8.000000	28.000000	190.087000	4022.000000	1.000000	1.000000	1.000000	1.000000	1236.000000

DF.dtypes

spam              int64
to_multiple       int64
from              int64
cc                int64
sent_email        int64
time             object
image             int64
attach            int64
dollar            int64
winner           object
inherit           int64
viagra            int64
password          int64
num_char        float64
line_breaks       int64
format            int64
re_subj           int64
exclaim_subj      int64
urgent_subj       int64
exclaim_mess      int64
number           object
dtype: object

How do we model categorical data for prediction

Today we will learn how to do binary classification. Can we predict whether or not an email is spam - we are building a simple spam filter.

Given this data, what do you think might be predictive of a spam email?
Would you expect longer or shorter emails to be spam?
Would you expect emails that have subjects starting with “Re:”, “RE:”, “re:”, or “rE:” to be spam or not?

Analyze the number of characters in an email

DF[['num_char','spam']].groupby('spam').describe()

	num_char
	count	mean	std	min	25%	50%	75%	max
spam
0	3554.0	11.250517	14.510758	0.003	1.92525	6.831	15.5075	190.087
1	367.0	5.439204	14.920101	0.001	0.47250	1.046	3.2905	173.956

fig = px.histogram(DF,
                   x='num_char',
                   color='spam',
                   facet_col='spam',
                   nbins=50,
                   histnorm='probability density',
                   marginal="box",
                   color_discrete_sequence=px.colors.qualitative.Safe)


fig.update_layout(template="ggplot2",
                  bargap=0.02,
                  title='Spam vs. Number of Characters',
                  title_x=0.5,
                  showlegend=False,
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

What do you notice?

Both distributions are very right skewed.
Both contain some outliers.
Spam messages have a much higher probability of being very short.

Analyze the subject title of the emails

Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”

DF['re_subj'].value_counts()

re_subj
0    2896
1    1025
Name: count, dtype: int64

fig = px.histogram(DF,
                   x='re_subj',
                   color='spam',
                   barnorm='percent',
                   color_discrete_sequence=px.colors.qualitative.Safe)

fig.update_layout(template="ggplot2",
                  title='Spam vs "re." in Subject',
                  title_x=0.5,
                  bargap = 0.1,
                  yaxis_title="",
                  xaxis_title='Re in Subject',
                  legend_title='Spam',
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

What do you notice?

Spam messages tend to not have ‘Re’ in the email
But a small number of spam messages to have ‘Re’ added to the subject line.

Modeling spam

Both number of characters and whether the message has “re:” in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship?
For simplicity, we’ll focus on the number of characters (num_char) as predictor, but the model we describe can be expanded to take multiple predictors as well.
This isn’t something we can reasonably fit a linear model to – we need something different!

Why not?

# Get a subset of the rows
DF_model = DF[['num_char','spam']]

# Get the variables
X = DF_model['num_char'].values.reshape(-1,1)
y = DF_model['spam']

# Do the regression
LM = LinearRegression()
LM.fit(X,y)

# Save the predicted values to the data frame
DF_model['prediction'] = LM.predict(X)

# Plot the results
fig = px.scatter(DF_model,x='num_char',y='spam')

fig.add_trace(
    px.line(DF_model, x='num_char', y='prediction',color_discrete_sequence=['black']).data[0]
)

fig.show()

print(LM.coef_)
print(LM.intercept_)
print(LM.score(X,y))

[-0.00229906]
0.11821360854534321
0.013360530323788145

What would it even me to use this model?
Should the output of the model be numbers other than 0 or 1?

Logistic Regression - Framing the problem

We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
Bernoulli trial: a random experiment (flipping a coin) with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted
Each Bernoulli trial can have a separate probability of success
We can then use the predictor variables (number of characters) to model that probability of success, \(p_i\), (it is spam)
We can’t just use a linear model for \(p_i\) (since \(p_i\) must be between 0 and 1) but we can transform the linear model to have the appropriate range.

Generalized linear models (GLM)

This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)
Logistic regression is just one example

Three characteristics of GLMs|

All GLMs have the following three characteristics:

A probability distribution describing a generative model for the outcome variable
A linear model \[\eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]
A link function that relates the linear model to the parameter of the outcome distribution. Relate the value of \(\eta\) back to the probability of success.

Logistic regression

Logistic regression is a GLM used to model a binary categorical outcome (two possible answers) using numerical and categorical predictors
To finish specifying the Logistic model we just need to define a reasonable link function that connects \(\eta_i\) to \(p_i\): logit function
Logit function: For \(0\le p \le 1\)

\[logit(p) = \log\left(\frac{p}{1-p}\right)\]

p = np.arange(0.01,1,.01)
lp = np.log(p/(1-p))

fig = px.line(x=p,y=lp,range_x=[-.01, 1.01])

fig.update_layout(title='Logit Function',
                  title_x=0.5,
                  template = 'ggplot2',
                  yaxis_title="logit(p)",
                  xaxis_title='p'
                  )
                  

fig.show()

Properties of the logit

The logit function takes a value between 0 and 1 and maps it to a value between \(-\infty\) and \(\infty\)
Inverse logit (logistic) function: \[g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}\]
The inverse logit function takes a value between \(-\infty\) and \(\infty\) and maps it to a value between 0 and 1
This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success

p = np.arange(-10,10,.01)
logistic = 1/(1+np.exp(-p))

fig = px.line(x=p,y=logistic,range_y=[-.01, 1.01])

fig.update_layout(title='Logistic Function',
                  title_x=0.5,
                  template = 'ggplot2',
                  yaxis_title="logistic(p)",
                  xaxis_title='p'
                  )
                  

fig.show()

The logistic regression model

Based on the three GLM criteria we have
- \(y_i \sim \text{Bern}(p_i)\) - we have an experiment that is a result of a Bernoulli Trial
- \(\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}\) - We have a linear model
- \(\text{logit}(p_i) = \eta_i\) - We have a linking function
From which we get

\[p_i = \frac{e^{(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}{1+e^{(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}\]

This looks a little scary, but all we did was plug our linear model into the logistic function! This means we can predict the probability of our \(i^{th}\) experiment.

Building a Logistic Regression Model in Python

# Get a subset of the rows
DF_model = DF[['num_char','spam']]

# Get the variables
X = DF_model['num_char'].values.reshape(-1,1)
y = DF_model['spam']

# Do the regression
LM = LogisticRegression()
LM.fit(X,y)

print('Classes:')
print(LM.classes_)
print('Coefficients:')
print(LM.coef_)
print('Intercept:')
print(LM.intercept_)

Classes:
[0 1]
Coefficients:
[[-0.06207082]]
Intercept:
[-1.79876282]

What does this output mean?

Well we have a slope and an intercept for our linear model:

\[\eta = = -1.80-0.0621(numchar)\]

plug this into our logit function

\[\log\left(\frac{p}{1-p}\right) = -1.80-0.0621(numchar)\]

solve for the probability - exponent to get rid of the log()

\[\frac{p}{1-p} = e^{-1.80-0.0621(numchar)}\]

\[p = \frac{e^{-1.80-0.0621(numchar)}}{1+e^{-1.80-0.0621(numchar)}}\]

What is the probability that an email with 2000 characters is spam?

num = 2
intercept = -1.80
slope = -0.0621

eta = intercept + slope*num

P = np.exp(eta)/(1+np.exp(eta))
print(P)

0.1273939433505534

What is the probability that an email with 40000 characters is spam?

num = 40
intercept = -1.80
slope = -0.0621

eta = intercept + slope*num

P = np.exp(eta)/(1+np.exp(eta))
print(P)

0.013599894814523065

Model probability as a function of number of characters

intercept = -1.80
slope = -0.0621

P = []

num_characters = list(np.arange(0,200,10))

for num in num_characters:
    eta = intercept + slope*num
    P.append(np.exp(eta)/(1+np.exp(eta)))


# Plot the results
fig = px.scatter(DF_model,x='num_char',y='spam',opacity=.5)

fig.add_trace(
    px.line(x=num_characters,y=P,color_discrete_sequence=['black']).data[0]
)

fig.show()

What do we see here

The probability of being spam decreases as the number of characters increases.
The probability of spam overall is really low!
What should the cutoff be?

Would you prefer an email with 2000 characters to be labeled as spam or not? How about 40,000 characters?

There is a 12.7% probability that an email with 2000 characters is spam
There is a 1.4% probability that an email with 40,000 characters is spam

Should any of these skip the inbox?

Sensitivity and specificity

False positive and negative

	Email is spam	Email is not spam
Email labelled spam	True positive	False positive (Type 1 error)
Email labelled not spam	False negative (Type 2 error)	True negative

False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)
False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)

Sensitivity and specificity

	Email is spam	Email is not spam
Email labelled spam	True positive	False positive (Type 1 error)
Email labelled not spam	False negative (Type 2 error)	True negative

Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)
Sensitivity = 1 − False negative rate
Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)
Specificity = 1 − False positive rate

If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?

Python Confusion Matrix

Classes = [0,1] = [not spam, spam]

	Result is 0	Result is not 0
Results labeled 0	True Positive	False positive
Result not labeled 0	False negative	True negative

# import the metrics class
from sklearn import metrics

DF_model['prediction'] = LM.predict(X)

y_pred = DF_model['prediction'].values

cnf_matrix = metrics.confusion_matrix(y, y_pred)
cnf_matrix

array([[3554,    0],
       [ 367,    0]])

	True positive = 3554	False positive = 0
	False negative = 367	True negative =0

TP = cnf_matrix[0,0] # The email was not spam and was correctly labeled 0
FP = cnf_matrix[0,1] # The email was spam and was incorrectly labeled as 0
FN = cnf_matrix[1,0] # The email was not spam at was incorrectly labeled 1
TN = cnf_matrix[1,1] # The email was spam and was correctly labeled 1

What does this mean

This means we predicted that nothing was spam.
By default LogisticRegression() and .predict() use a 50% cutoff!

Is this what we want?

Change your decision probabilities

We can look into the .predict_proba() and use some programming to classify things.

.predict_proba() returns the probability of each class for each observation. In this case with will be two colums the first being the probability that we got 0 the second being the probability that we got 1 for each row in the data frame.

# Create new columns in your data frame
DF_model[['prob not spam','prob spam']] = LM.predict_proba(X)

# Look at the model DF
DF_model

	num_char	spam	prediction	prob not spam	prob spam
0	11.370	0	0	0.924457	0.075543
1	10.504	0	0	0.920617	0.079383
2	7.773	0	0	0.907311	0.092689
3	13.256	0	0	0.932237	0.067763
4	1.231	0	0	0.867056	0.132944
...	...	...	...	...	...
3916	0.332	1	0	0.860491	0.139509
3917	0.323	1	0	0.860423	0.139577
3918	8.656	0	0	0.911819	0.088181
3919	10.185	0	0	0.919157	0.080843
3920	2.225	1	0	0.874008	0.125992

3921 rows × 5 columns

# Choose your cutoff

cutoff = .1 #if prob spam is 5% or higher it is labeled spam

# Use a lambda on the new columns to create your new prediction
DF_model['new_prediction'] = DF_model['prob spam'].apply(lambda x: 1 if x>cutoff else 0)

# Look at the model DF
DF_model

	num_char	spam	prediction	prob not spam	prob spam	new_prediction
0	11.370	0	0	0.924457	0.075543	0
1	10.504	0	0	0.920617	0.079383	0
2	7.773	0	0	0.907311	0.092689	0
3	13.256	0	0	0.932237	0.067763	0
4	1.231	0	0	0.867056	0.132944	1
...	...	...	...	...	...	...
3916	0.332	1	0	0.860491	0.139509	1
3917	0.323	1	0	0.860423	0.139577	1
3918	8.656	0	0	0.911819	0.088181	0
3919	10.185	0	0	0.919157	0.080843	0
3920	2.225	1	0	0.874008	0.125992	1

3921 rows × 6 columns

# Redo the confusion matrix

y_pred = DF_model['new_prediction'].values

cnf_matrix = metrics.confusion_matrix(y, y_pred)
cnf_matrix

array([[1831, 1723],
       [  62,  305]])

TP = cnf_matrix[0,0]
FP = cnf_matrix[0,1]
FN = cnf_matrix[1,0]
TN = cnf_matrix[1,1]

print('False Negative Rate:')
print(FN/ (TP+FN))
print('------')
print('False Positive Rate:')
print(FP/ (FP+TN))
print('------')
print('Sensitivity:')
print(1 - (FN/ (TP+FN)))
print('------')
print('Specificity:')
print(1- (FP/ (FP+TN)))

False Negative Rate:
0.032752245113576335
------
False Positive Rate:
0.8496055226824457
------
Sensitivity:
0.9672477548864237
------
Specificity:
0.1503944773175543

Remember to interpret your results!

False negative means an email not spam and it was labeled spam. This happened in 3.3% of cases where the email was not spam.
False positive means that am email was spam and was labeled not spam. This happened in 85% of the cases where the email was spam.
If the email was not spam it has a 96.7% probability of being labeled not spam.
If the email was spam it has a 15.1% probability of being labeled spam.

Exercise 1 Logistic Regression with ONE explanatory variable.

Choose another variable from the data set to use as your explanatory variable and create a Logistic Regression model to predict if an email is spam or not. You should do all of the following:

Say what variable you are using to predict spam messages (do some analysis, at minimum a value_counts()). Why do you think this is a good variable to use in predicting if an email is spam.
Create and fit a Logistic Regression model.
Show the results: intercept, coefficient, basic confusion matrix prediction.
What do you think the decision cutoff should be? Update the cutoff and redo the confusion matrix.
Explain your results in words. You should talk about False Negative and False positive rates and what they mean in terms of the variables you chose.

Exercise 2 - challenge Logistic Regression with MORE THAN ONE explanatory variable.

Try redoing the analysis, but this time add a few more explanatory variables. Again do some analysis of the variables you are chosing and state why they are a good choice. Then answer again questions 1-5.