Introduction to Data Science

Classifying Categorical Data - Logistic Regression

Author

Joanna Bieri
DATA101

Important Information

Announcements

Come to Lab! If you need help we are here to help!

Day 18 Assignment - same drill.

  1. Make sure you can Fork and Clone the Day18 repo from Redlands-DATA101
  2. Open the file Day18-HW.ipynb and start doing the problems.
    • You can do these problems as you follow along with the lecture notes and video.
  3. Get as far as you can before class.
  4. Submit what you have so far Commit and Push to Git.
  5. Take the daily check in quiz on Canvas.
  6. Come to class with lots of questions!
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression 

Data: A collection of Emails

  • Emails for the first three months of 2012 for an email account
  • Data from 3921 emails and 21 variables on them
  • Outcome: whether the email is spam or not
  • Predictors: number of characters, whether the email had “Re:” in the subject, time at which email was sent, number of times the word “inherit” shows up in the email, etc.

Data Information: https://www.openintro.org/data/index.php?data=email

This lab follows the Data Science in a Box units “Unit 4 - Deck 6: Logistic regression” by Mine Çetinkaya-Rundel. It has been updated for our class and translated to Python by Joanna Bieri.

file_name = 'data/email.csv'
DF = pd.read_csv(file_name)
DF
spam to_multiple from cc sent_email time image attach dollar winner ... viagra password num_char line_breaks format re_subj exclaim_subj urgent_subj exclaim_mess number
0 0 0 1 0 0 2012-01-01T06:16:41Z 0 0 0 no ... 0 0 11.370 202 1 0 0 0 0 big
1 0 0 1 0 0 2012-01-01T07:03:59Z 0 0 0 no ... 0 0 10.504 202 1 0 0 0 1 small
2 0 0 1 0 0 2012-01-01T16:00:32Z 0 0 4 no ... 0 0 7.773 192 1 0 0 0 6 small
3 0 0 1 0 0 2012-01-01T09:09:49Z 0 0 0 no ... 0 0 13.256 255 1 0 0 0 48 small
4 0 0 1 0 0 2012-01-01T10:00:01Z 0 0 0 no ... 0 2 1.231 29 0 0 0 0 1 none
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3916 1 0 1 0 0 2012-03-31T00:03:45Z 0 0 0 no ... 0 0 0.332 12 0 0 0 0 0 small
3917 1 0 1 0 0 2012-03-31T14:13:19Z 0 0 1 no ... 0 0 0.323 15 0 0 0 0 0 small
3918 0 1 1 0 0 2012-03-30T16:20:33Z 0 0 0 no ... 0 0 8.656 208 1 0 0 0 5 small
3919 0 1 1 0 0 2012-03-28T16:00:49Z 0 0 0 no ... 0 0 10.185 132 0 0 0 0 0 small
3920 1 0 1 0 0 2012-03-31T09:20:24Z 0 0 2 yes ... 0 0 2.225 65 0 0 1 0 1 small

3921 rows × 21 columns

DF.shape
(3921, 21)
DF.columns
Index(['spam', 'to_multiple', 'from', 'cc', 'sent_email', 'time', 'image',
       'attach', 'dollar', 'winner', 'inherit', 'viagra', 'password',
       'num_char', 'line_breaks', 'format', 're_subj', 'exclaim_subj',
       'urgent_subj', 'exclaim_mess', 'number'],
      dtype='object')
DF.describe()
spam to_multiple from cc sent_email image attach dollar inherit viagra password num_char line_breaks format re_subj exclaim_subj urgent_subj exclaim_mess
count 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000 3921.000000
mean 0.093599 0.158123 0.999235 0.404489 0.277990 0.048457 0.132874 1.467228 0.038001 0.002040 0.108136 10.706586 230.658505 0.695231 0.261413 0.080337 0.001785 6.584290
std 0.291307 0.364903 0.027654 2.666424 0.448066 0.450848 0.718518 5.022298 0.267899 0.127759 0.959931 14.645786 319.304959 0.460368 0.439460 0.271848 0.042220 51.479871
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.001000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.459000 34.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.856000 119.000000 1.000000 0.000000 0.000000 0.000000 1.000000
75% 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 14.084000 298.000000 1.000000 1.000000 0.000000 0.000000 4.000000
max 1.000000 1.000000 1.000000 68.000000 1.000000 20.000000 21.000000 64.000000 9.000000 8.000000 28.000000 190.087000 4022.000000 1.000000 1.000000 1.000000 1.000000 1236.000000
DF.dtypes
spam              int64
to_multiple       int64
from              int64
cc                int64
sent_email        int64
time             object
image             int64
attach            int64
dollar            int64
winner           object
inherit           int64
viagra            int64
password          int64
num_char        float64
line_breaks       int64
format            int64
re_subj           int64
exclaim_subj      int64
urgent_subj       int64
exclaim_mess      int64
number           object
dtype: object

How do we model categorical data for prediction

Today we will learn how to do binary classification. Can we predict whether or not an email is spam - we are building a simple spam filter.

  • Given this data, what do you think might be predictive of a spam email?

  • Would you expect longer or shorter emails to be spam?

  • Would you expect emails that have subjects starting with “Re:”, “RE:”, “re:”, or “rE:” to be spam or not?

Analyze the number of characters in an email

DF[['num_char','spam']].groupby('spam').describe()
num_char
count mean std min 25% 50% 75% max
spam
0 3554.0 11.250517 14.510758 0.003 1.92525 6.831 15.5075 190.087
1 367.0 5.439204 14.920101 0.001 0.47250 1.046 3.2905 173.956
fig = px.histogram(DF,
                   x='num_char',
                   color='spam',
                   facet_col='spam',
                   nbins=50,
                   histnorm='probability density',
                   marginal="box",
                   color_discrete_sequence=px.colors.qualitative.Safe)


fig.update_layout(template="ggplot2",
                  bargap=0.02,
                  title='Spam vs. Number of Characters',
                  title_x=0.5,
                  showlegend=False,
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

What do you notice?

  • Both distributions are very right skewed.
  • Both contain some outliers.
  • Spam messages have a much higher probability of being very short.

Analyze the subject title of the emails

Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”

DF['re_subj'].value_counts()
re_subj
0    2896
1    1025
Name: count, dtype: int64
fig = px.histogram(DF,
                   x='re_subj',
                   color='spam',
                   barnorm='percent',
                   color_discrete_sequence=px.colors.qualitative.Safe)

fig.update_layout(template="ggplot2",
                  title='Spam vs "re." in Subject',
                  title_x=0.5,
                  bargap = 0.1,
                  yaxis_title="",
                  xaxis_title='Re in Subject',
                  legend_title='Spam',
                  autosize=False,
                  width=800,
                  height=500)

fig.show()

What do you notice?

  • Spam messages tend to not have ‘Re’ in the email
  • But a small number of spam messages to have ‘Re’ added to the subject line.

Modeling spam

  • Both number of characters and whether the message has “re:” in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship?
  • For simplicity, we’ll focus on the number of characters (num_char) as predictor, but the model we describe can be expanded to take multiple predictors as well.
  • This isn’t something we can reasonably fit a linear model to – we need something different!

Why not?

# Get a subset of the rows
DF_model = DF[['num_char','spam']]

# Get the variables
X = DF_model['num_char'].values.reshape(-1,1)
y = DF_model['spam']

# Do the regression
LM = LinearRegression()
LM.fit(X,y)

# Save the predicted values to the data frame
DF_model['prediction'] = LM.predict(X)

# Plot the results
fig = px.scatter(DF_model,x='num_char',y='spam')

fig.add_trace(
    px.line(DF_model, x='num_char', y='prediction',color_discrete_sequence=['black']).data[0]
)

fig.show()

print(LM.coef_)
print(LM.intercept_)
print(LM.score(X,y))
[-0.00229906]
0.11821360854534321
0.013360530323788145
  • What would it even me to use this model?
  • Should the output of the model be numbers other than 0 or 1?

Logistic Regression - Framing the problem

  • We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials

  • Bernoulli trial: a random experiment (flipping a coin) with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted

  • Each Bernoulli trial can have a separate probability of success

  • We can then use the predictor variables (number of characters) to model that probability of success, \(p_i\), (it is spam)

  • We can’t just use a linear model for \(p_i\) (since \(p_i\) must be between 0 and 1) but we can transform the linear model to have the appropriate range.

Generalized linear models (GLM)

  • This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)

  • Logistic regression is just one example

Three characteristics of GLMs|

All GLMs have the following three characteristics:

  1. A probability distribution describing a generative model for the outcome variable

  2. A linear model \[\eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]

  3. A link function that relates the linear model to the parameter of the outcome distribution. Relate the value of \(\eta\) back to the probability of success.

Logistic regression

  • Logistic regression is a GLM used to model a binary categorical outcome (two possible answers) using numerical and categorical predictors

  • To finish specifying the Logistic model we just need to define a reasonable link function that connects \(\eta_i\) to \(p_i\): logit function

  • Logit function: For \(0\le p \le 1\)

\[logit(p) = \log\left(\frac{p}{1-p}\right)\]

p = np.arange(0.01,1,.01)
lp = np.log(p/(1-p))

fig = px.line(x=p,y=lp,range_x=[-.01, 1.01])

fig.update_layout(title='Logit Function',
                  title_x=0.5,
                  template = 'ggplot2',
                  yaxis_title="logit(p)",
                  xaxis_title='p'
                  )
                  

fig.show()

Properties of the logit

  • The logit function takes a value between 0 and 1 and maps it to a value between \(-\infty\) and \(\infty\)

  • Inverse logit (logistic) function: \[g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}\]

  • The inverse logit function takes a value between \(-\infty\) and \(\infty\) and maps it to a value between 0 and 1

  • This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success

p = np.arange(-10,10,.01)
logistic = 1/(1+np.exp(-p))

fig = px.line(x=p,y=logistic,range_y=[-.01, 1.01])

fig.update_layout(title='Logistic Function',
                  title_x=0.5,
                  template = 'ggplot2',
                  yaxis_title="logistic(p)",
                  xaxis_title='p'
                  )
                  

fig.show()

The logistic regression model

  • Based on the three GLM criteria we have
    • \(y_i \sim \text{Bern}(p_i)\) - we have an experiment that is a result of a Bernoulli Trial
    • \(\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}\) - We have a linear model
    • \(\text{logit}(p_i) = \eta_i\) - We have a linking function
  • From which we get

\[p_i = \frac{e^{(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}{1+e^{(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}}\]

This looks a little scary, but all we did was plug our linear model into the logistic function! This means we can predict the probability of our \(i^{th}\) experiment.

Building a Logistic Regression Model in Python

# Get a subset of the rows
DF_model = DF[['num_char','spam']]

# Get the variables
X = DF_model['num_char'].values.reshape(-1,1)
y = DF_model['spam']

# Do the regression
LM = LogisticRegression()
LM.fit(X,y)

print('Classes:')
print(LM.classes_)
print('Coefficients:')
print(LM.coef_)
print('Intercept:')
print(LM.intercept_)
Classes:
[0 1]
Coefficients:
[[-0.06207082]]
Intercept:
[-1.79876282]

What does this output mean?

Well we have a slope and an intercept for our linear model:

\[\eta = = -1.80-0.0621(numchar)\]

plug this into our logit function

\[\log\left(\frac{p}{1-p}\right) = -1.80-0.0621(numchar)\]

solve for the probability - exponent to get rid of the log()

\[\frac{p}{1-p} = e^{-1.80-0.0621(numchar)}\]

\[p = \frac{e^{-1.80-0.0621(numchar)}}{1+e^{-1.80-0.0621(numchar)}}\]

What is the probability that an email with 2000 characters is spam?

num = 2
intercept = -1.80
slope = -0.0621

eta = intercept + slope*num

P = np.exp(eta)/(1+np.exp(eta))
print(P)
0.1273939433505534

What is the probability that an email with 40000 characters is spam?

num = 40
intercept = -1.80
slope = -0.0621

eta = intercept + slope*num

P = np.exp(eta)/(1+np.exp(eta))
print(P)
0.013599894814523065

Model probability as a function of number of characters

intercept = -1.80
slope = -0.0621

P = []

num_characters = list(np.arange(0,200,10))

for num in num_characters:
    eta = intercept + slope*num
    P.append(np.exp(eta)/(1+np.exp(eta)))


# Plot the results
fig = px.scatter(DF_model,x='num_char',y='spam',opacity=.5)

fig.add_trace(
    px.line(x=num_characters,y=P,color_discrete_sequence=['black']).data[0]
)

fig.show()

What do we see here

  • The probability of being spam decreases as the number of characters increases.
  • The probability of spam overall is really low!
  • What should the cutoff be?

Would you prefer an email with 2000 characters to be labeled as spam or not? How about 40,000 characters?

  • There is a 12.7% probability that an email with 2000 characters is spam
  • There is a 1.4% probability that an email with 40,000 characters is spam

Should any of these skip the inbox?

Sensitivity and specificity

False positive and negative

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
  • False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)

  • False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)

Sensitivity and specificity

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
  • Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)

  • Sensitivity = 1 − False negative rate

  • Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)

  • Specificity = 1 − False positive rate

If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?

Python Confusion Matrix

Classes = [0,1] = [not spam, spam]
Result is 0 Result is not 0
Results labeled 0 True Positive False positive
Result not labeled 0 False negative True negative
# import the metrics class
from sklearn import metrics

DF_model['prediction'] = LM.predict(X)

y_pred = DF_model['prediction'].values

cnf_matrix = metrics.confusion_matrix(y, y_pred)
cnf_matrix
array([[3554,    0],
       [ 367,    0]])
True positive = 3554 False positive = 0
False negative = 367 True negative =0
TP = cnf_matrix[0,0] # The email was not spam and was correctly labeled 0
FP = cnf_matrix[0,1] # The email was spam and was incorrectly labeled as 0
FN = cnf_matrix[1,0] # The email was not spam at was incorrectly labeled 1
TN = cnf_matrix[1,1] # The email was spam and was correctly labeled 1

What does this mean

  • This means we predicted that nothing was spam.
  • By default LogisticRegression() and .predict() use a 50% cutoff!

Is this what we want?

Change your decision probabilities

We can look into the .predict_proba() and use some programming to classify things.

.predict_proba() returns the probability of each class for each observation. In this case with will be two colums the first being the probability that we got 0 the second being the probability that we got 1 for each row in the data frame.

# Create new columns in your data frame
DF_model[['prob not spam','prob spam']] = LM.predict_proba(X)
# Look at the model DF
DF_model
num_char spam prediction prob not spam prob spam
0 11.370 0 0 0.924457 0.075543
1 10.504 0 0 0.920617 0.079383
2 7.773 0 0 0.907311 0.092689
3 13.256 0 0 0.932237 0.067763
4 1.231 0 0 0.867056 0.132944
... ... ... ... ... ...
3916 0.332 1 0 0.860491 0.139509
3917 0.323 1 0 0.860423 0.139577
3918 8.656 0 0 0.911819 0.088181
3919 10.185 0 0 0.919157 0.080843
3920 2.225 1 0 0.874008 0.125992

3921 rows × 5 columns

# Choose your cutoff

cutoff = .1 #if prob spam is 5% or higher it is labeled spam

# Use a lambda on the new columns to create your new prediction
DF_model['new_prediction'] = DF_model['prob spam'].apply(lambda x: 1 if x>cutoff else 0)

# Look at the model DF
DF_model
num_char spam prediction prob not spam prob spam new_prediction
0 11.370 0 0 0.924457 0.075543 0
1 10.504 0 0 0.920617 0.079383 0
2 7.773 0 0 0.907311 0.092689 0
3 13.256 0 0 0.932237 0.067763 0
4 1.231 0 0 0.867056 0.132944 1
... ... ... ... ... ... ...
3916 0.332 1 0 0.860491 0.139509 1
3917 0.323 1 0 0.860423 0.139577 1
3918 8.656 0 0 0.911819 0.088181 0
3919 10.185 0 0 0.919157 0.080843 0
3920 2.225 1 0 0.874008 0.125992 1

3921 rows × 6 columns

# Redo the confusion matrix

y_pred = DF_model['new_prediction'].values

cnf_matrix = metrics.confusion_matrix(y, y_pred)
cnf_matrix
array([[1831, 1723],
       [  62,  305]])
TP = cnf_matrix[0,0]
FP = cnf_matrix[0,1]
FN = cnf_matrix[1,0]
TN = cnf_matrix[1,1]

print('False Negative Rate:')
print(FN/ (TP+FN))
print('------')
print('False Positive Rate:')
print(FP/ (FP+TN))
print('------')
print('Sensitivity:')
print(1 - (FN/ (TP+FN)))
print('------')
print('Specificity:')
print(1- (FP/ (FP+TN)))
False Negative Rate:
0.032752245113576335
------
False Positive Rate:
0.8496055226824457
------
Sensitivity:
0.9672477548864237
------
Specificity:
0.1503944773175543

Remember to interpret your results!

  • False negative means an email not spam and it was labeled spam. This happened in 3.3% of cases where the email was not spam.

  • False positive means that am email was spam and was labeled not spam. This happened in 85% of the cases where the email was spam.

  • If the email was not spam it has a 96.7% probability of being labeled not spam.

  • If the email was spam it has a 15.1% probability of being labeled spam.

Exercise 1 Logistic Regression with ONE explanatory variable.

Choose another variable from the data set to use as your explanatory variable and create a Logistic Regression model to predict if an email is spam or not. You should do all of the following:

  1. Say what variable you are using to predict spam messages (do some analysis, at minimum a value_counts()). Why do you think this is a good variable to use in predicting if an email is spam.
  2. Create and fit a Logistic Regression model.
  3. Show the results: intercept, coefficient, basic confusion matrix prediction.
  4. What do you think the decision cutoff should be? Update the cutoff and redo the confusion matrix.
  5. Explain your results in words. You should talk about False Negative and False positive rates and what they mean in terms of the variables you chose.

Exercise 2 - challenge Logistic Regression with MORE THAN ONE explanatory variable.

Try redoing the analysis, but this time add a few more explanatory variables. Again do some analysis of the variables you are chosing and state why they are a good choice. Then answer again questions 1-5.