import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
# This stops a few warning messages from showing
= None
pd.options.mode.chained_assignment import warnings
='ignore', category=FutureWarning) warnings.simplefilter(action
Introduction to Data Science
Introduction to Modeling and Algorithms
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours: Duke 209 Click Here for Joanna’s Schedule
Announcements
If you are behind it is time to get 100% caught up! I will expect to see you at one of the labs!
Day 15 Assignment - same drill.
- Make sure you can Fork and Clone the Day15 repo from Redlands-DATA101
- Open the file Day15-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
- Get as far as you can before class.
- Submit what you have so far Commit and Push to Git.
- Take the daily check in quiz on Canvas.
- Come to class with lots of questions!
————————
Today we start venturing into the world of Algorithms and Models. During our ethics week we had lots of discussions about ethical issues with algorithmic bias and decision making using data. Now you are going to get a peak behind the scenes to see how these models are built.
These words are all related!
Mathematical Model – Data Science Prediction – Algorithm – Machine Learning – Artificial Intelligence
What is a Mathematical Model?
A model is used to explain the relationship between variables so that we can make predictions. Ideally we start by doing some exploratory data analysis (EDA) and some basic visualizations to get a feel for our data. Then we start asking some questions like:
- Is there a relationship between A and B?
- If I knew A could I predict B?
This lab follows the Data Science in a Box units “The Language of Models and Fitting and Interpreting Models” by Mine Çetinkaya-Rundel. It has been updated for our class and translated to Python by Joanna Bieri.
Linear vs Nonlinear Models:
A linear model can be described by a straight line.
\[ Y = mX + b \]
Here our goal is to use \(X\) to predict \(Y\). Another example of a linear model is training a machine to make a decision based on crossing a straight line boundary.
Code for creating the linear example graph
# Parameters for the linear relationship
= 2.5 # slope
m = 5.0 # intercept
b = 100 # number of data points
num_points = 2.0 # standard deviation of noise
noise_level
# Generate random independent variable (x) values
= np.random.uniform(0, 10, num_points)
x
# Generate the dependent variable (y) values using the linear equation
# Adding some random noise (epsilon) to simulate real-world data
= np.random.normal(0, noise_level, num_points)
epsilon = m * x + b + epsilon
y = m * x + b
y_nonoise
= pd.DataFrame()
DF 'x'] = x
DF['y'] = y
DF['yn'] = y_nonoise
DF[
=DF.sort_values('x')
DF
= px.scatter(DF,x='x',y='y')
fig # Add a line to the plot
fig.add_trace(='x', y='yn',color_discrete_sequence=['red']).data[0]
px.line(DF, x
)
fig.show()
A nonlinear model cannot be described by a straight line. It might be modeled by a wide range of other functions!
\[ Y = f(X) \]
So that we can use \(X\) to predict \(Y\) and \(f\) is a nonlinear function, or training a machine to make a decision based on crossing a nonlinear boundary.
Code for creating the linear example graph
# Parameters for the nonlinear relationship (quadratic)
= 1.5 # coefficient for x^2
a = -2.0 # coefficient for x
b = 3.0 # constant term
c = 100 # number of data points
num_points = 1.0 # standard deviation of noise
noise_level
# Generate random independent variable (x) values
= np.random.uniform(-2, 2, num_points)
x
# Generate the dependent variable (y) values using the quadratic equation
# Adding some random noise (epsilon) to simulate real-world data
= np.random.normal(0, noise_level, num_points)
epsilon = a * x**3 + b * x + c + epsilon
y = a * x**3 + b * x + c
y_nonoise
= pd.DataFrame()
DF 'x'] = x
DF['y'] = y
DF['yn'] = y_nonoise
DF[
=DF.sort_values('x')
DF
= px.scatter(DF,x='x',y='y')
fig # Add a line to the plot
fig.add_trace(='x', y='yn',color_discrete_sequence=['red']).data[0]
px.line(DF, x
)
fig.show()
Paris Paintings Data
To explore the ideas of modeling data we will use the Paris Paintings dataset.
- Source: Printed catalogs of 28 auction sales in Paris, 1764 - 1780 (Historical Data)
- Data curators Sandra van Ginhoven and Hilary Coe Cronheim (who were PhD students in the Duke Art, Law, and Markets Initiative at the time of putting together this dataset) translated and tabulated the catalogs
- 3393 paintings, their prices, and descriptive details from sales catalogs over 60 variables
They looked at individual catalogs and entered the data by hand. This means that there are some inconsistencies.
n/a vs NA vs empty - all of these should be considered NaN.
Here is an example of what an art auction looks like today. The people at this auction are looking at pamphlets that even include color photos of the painting that is up for sale. Today there are all sorts of ways of getting info about what is up for auction.
In historical times, this was not possible. Instead the information was communicated in printed catalogs without images. This is what the auction of that time might look like.
Pierre-Antoine de Machy, Public Sale at the Hôtel Bullion, Musée Carnavalet, Paris (18th century)
The researchers were interested in the Art market. They were students of economics and art history. Their question: Were the auctioneers doing anything to motivate the large increase in sales between the years of 1764-1780?
Plot credit: Sandra van Ginhoven
Historical Example
Two paintings very rich in composition, of a beautiful execution, and whose merit is very remarkable, each 17 inches 3 lines high, 23 inches wide; the first, painted on wood, comes from the Cabinet of Madame la Comtesse de Verrue; it represents a departure for the hunt: it shows in the front a child on a white horse, a man who gives the horn to gather the dogs, a falconer and other figures nicely distributed across the width of the painting; two horses drinking from a fountain; on the right in the corner a lovely country house topped by a terrace, on which people are at the table, others who play instruments; trees and fabriques pleasantly enrich the background.
There is data in this description!
Variables in Paris Paintings Data
= 'https://joannabieri.com/introdatascience/data/paris-paintings.csv'
file_location = pd.read_csv(file_location,na_filter=False) DF_raw_paintings
show(DF_raw_paintings)
name | sale | lot | position | dealer | year | origin_author | origin_cat | school_pntg | diff_origin | logprice | price | count | subject | authorstandard | artistliving | authorstyle | author | winningbidder | winningbiddertype | endbuyer | Interm | type_intermed | Height_in | Width_in | Surface_Rect | Diam_in | Surface_Rnd | Shape | Surface | material | mat | materialCat | quantity | nfigures | engraved | original | prevcoll | othartist | paired | figures | finished | lrgfont | relig | landsALL | lands_sc | lands_elem | lands_figs | lands_ment | arch | mytho | peasant | othgenre | singlefig | portrait | still_life | discauth | history | allegory | pastorale | other |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the internet...
(need help?) |
Lets look at the information for the painting described above.
= DF_raw_paintings['name']=='R1777-89a'
mask show(DF_raw_paintings[mask])
name | sale | lot | position | dealer | year | origin_author | origin_cat | school_pntg | diff_origin | logprice | price | count | subject | authorstandard | artistliving | authorstyle | author | winningbidder | winningbiddertype | endbuyer | Interm | type_intermed | Height_in | Width_in | Surface_Rect | Diam_in | Surface_Rnd | Shape | Surface | material | mat | materialCat | quantity | nfigures | engraved | original | prevcoll | othartist | paired | figures | finished | lrgfont | relig | landsALL | lands_sc | lands_elem | lands_figs | lands_ment | arch | mytho | peasant | othgenre | singlefig | portrait | still_life | discauth | history | allegory | pastorale | other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the internet...
(need help?) |
Some Data Science Questions
- Are any of these variables linearly related?
- Does the size of the paining matter?
- Did the auctioneers do something to drive up sales prices?
- Can we model the prices of the paintings?
- Are higher prices associated with certain features highlighted in the description?
# Make a copy of the data that we can start working on
= DF_raw_paintings.copy()
DF
# Do something about all those different NaNs
'',np.nan,inplace=True)
DF.replace('n/a',np.nan,inplace=True)
DF.replace('NaN',np.nan,inplace=True) DF.replace(
Modeling - Relationship Between Variables
Is there a linear relationship between the height and width of a painting?
Q Make historgrams of the height and width of all the paintings in the data set. You should be able to recreate the plots below without looking at the code.
# Look at the data types in these columns
= ['Height_in', 'Width_in']
my_cols DF[my_cols].dtypes
Height_in object
Width_in object
dtype: object
# Update the types - these should be floats
'Height_in'] = DF['Height_in'].apply(lambda x: float(x))
DF['Width_in'] = DF['Width_in'].apply(lambda x: float(x))
DF[ DF[my_cols].dtypes
Height_in float64
Width_in float64
dtype: object
Histogram of Height
= px.histogram(DF,x='Height_in')
fig
="ggplot2",
fig.update_layout(template="Height in inches",
xaxis_title="")
yaxis_title
fig.show()
Histogram of Width
= px.histogram(DF,x='Width_in')
fig
="ggplot2",
fig.update_layout(template="Height in inches",
xaxis_title="")
yaxis_title
fig.show()
Q Explain in words what these plots tell you about the data.
Models as Functions
- We can represent relationships between variables using functions
- A function is a mathematical concept: the relationship between an output and one or more inputs
Plug in the inputs and receive back the output
Example: The formula \[y = 3x + 7\] is a function with input \(x\) and output \(y\).
If \(x\) is \(5\), then \(y\) is \(22\), because \(y = 3 \times (5) + 7 = 22\)
flowchart LR A(Inputs) --> B[Function] --> C(Output)
flowchart LR A(x=5) --> B[3 * 5 + 7 = 22] --> C(y=22)
Can we find a function that describes the relationship between height and width of painting?
Start by looking at a scatter plot. Is there a relationship at all? Remember we are assuming that the data is somehow linear. That we can model it with a straight line.
Q Make a scatter plot of the width vs the height like the one below. You should be able to recreate the plots here without looking at the code.
Scatterplot Width vs Height
= px.scatter(DF,x='Width_in',y="Height_in",color_discrete_sequence=['black'])
fig
="ggplot2",
fig.update_layout(template='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
title=0.5,
title_x="Width (inches)",
xaxis_title="Height (inches)")
yaxis_title
fig.show()
Looking at this data there might be a linear relationship. We can add a linear fit line to this plot using the command:
trendline='ols'
This uses Ordinary Least Squares fitting to find a reasonable line. We will learn later about how we actually find these lines, but here you will just see you to make them appear in the plot.
= px.scatter(DF,
fig ='Width_in',
x="Height_in",
y=['black'],
color_discrete_sequence='ols',
trendline='overall',
trendline_scope='blue')
trendline_color_override
="ggplot2",
fig.update_layout(template='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
title=0.5,
title_x="Width (inches)",
xaxis_title="Height (inches)")
yaxis_title
fig.show()
= px.get_trendline_results(fig)
results = results.iloc[0]["px_fit_results"].summary()
info print(info)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.683
Model: OLS Adj. R-squared: 0.683
Method: Least Squares F-statistic: 6749.
Date: Fri, 08 Nov 2024 Prob (F-statistic): 0.00
Time: 10:41:38 Log-Likelihood: -11083.
No. Observations: 3135 AIC: 2.217e+04
Df Residuals: 3133 BIC: 2.218e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.6214 0.254 14.265 0.000 3.124 4.119
x1 0.7808 0.010 82.150 0.000 0.762 0.799
==============================================================================
Omnibus: 1258.323 Durbin-Watson: 1.707
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27933.347
Skew: 1.375 Prob(JB): 0.00
Kurtosis: 17.363 Cond. No. 45.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
So the line that “fits” this data based on the code we ran is
\[ H = 0.7808 W + 3.6214 \]
Q Where do we think this prediction is most accurate? Where is there the most error? Explain why you think this?
Adding uncertainty to the plot - advanced code
= px.scatter(DF,
fig ='Width_in',
x="Height_in",
y=['black'],
color_discrete_sequence='ols',
trendline='overall',
trendline_scope='blue')
trendline_color_override
= 0.010
m_se = 0.254
b_se = 0.7808
m = 3.6214
b
= pd.DataFrame()
DF_error 'w'] = DF['Width_in']
DF_error['ym'] = (m-m_se)*DF_error['w']+b-b_se
DF_error['yp'] = (m+m_se)*DF_error['w']+b+b_se
DF_error[= DF_error.sort_values('w')
DF_error
# Add a lower uncertainty line
fig.add_trace(='w', y='ym',color_discrete_sequence=['lightblue']).data[0]
px.line(DF_error, x
)
# Add an upper uncertainty line
fig.add_trace(='w', y='yp',color_discrete_sequence=['lightblue']).data[0]
px.line(DF_error, x
)
="ggplot2",
fig.update_layout(template='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
title=0.5,
title_x="Width (inches)",
xaxis_title="Height (inches)")
yaxis_title
fig.show()
Did we have to pick this trendline?
No there are other choices:
- ‘ols’: Ordinary Least Squares regression line (linear regression).
- ‘lowess’: Locally Weighted Scatterplot Smoothing (non-linear regression).
- ‘rolling’: Rolling window calculations (e.g., rolling average, rolling median).
- ‘expanding’: Expanding window calculations (e.g., expanding average, expanding sum).
- ‘ewm’: Exponentially Weighted Moving Average (EWMA).
Some of these we will cover in the next few weeks, but some will wait for DATA 201.
Vocab for Modeling
Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis
Explanatory variables: Other variables that you want to use to explain the variation in the response, on the x-axis
Predicted value: Output of the model function
- The model function gives the typical (expected) value of the response variable conditioning on the explanatory variables
Residuals: A measure of how far each case is from its predicted value (based on a particular model)
- Residual = Observed value - Predicted value
- Tells how far above/below the expected value each case is
Adding residual color to the plot - advanced code
# Add some conditional coloring - advanced Python list comprehension
'residual_hw'] = ['Positive residual' if H > 0.7808*W + 3.6214 else 'Negative residual' for H,W in zip(DF['Height_in'],DF['Width_in']) ]
DF[
= px.scatter(DF,
fig ='Width_in',
x="Height_in",
y='residual_hw',
color=['purple','black'],
color_discrete_sequence='ols',
trendline='overall',
trendline_scope='blue')
trendline_color_override
="ggplot2",
fig.update_layout(template='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
title=0.5,
title_x="Width (inches)",
xaxis_title="Height (inches)",
yaxis_title= '')
legend_title
fig.show()
Investigate the clustering of points
When creating scatter plots, sometimes it is hard to see when there are lots of points on top of each other. Changing the opacity (or alpha value) can help! This will make the points in the plot more transparent.
Scatterplot Width vs Height - Opacity=0.2
= px.scatter(DF,
fig ='Width_in',
x="Height_in",
y=['black'],
color_discrete_sequence=0.2)
opacity
="ggplot2",
fig.update_layout(template='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
title=0.5,
title_x="Width (inches)",
xaxis_title="Height (inches)")
yaxis_title
fig.show()
What starts to be apparent?
- We see a V in the plot!
- How many straight lines would you draw to model this data?
- What is going on here?
Painting types
- Landscape painting is the depiction in art of landscapes – natural scenery such as mountains, valleys, trees, rivers, and forests, especially where the main subject is a wide view – with its elements arranged into a coherent composition.
- Landscape paintings tend to be wider than they are long.
- Portrait painting is a genre in painting, where the intent is to depict a human subject.
- Portrait paintings tend to be longer than they are wide.
Source: Wikipedia, Landscape painting
Source: Wikipedia, Portait painting
# Look at the landscape data
'landsALL'].value_counts().rename_axis('Landscape').reset_index(name='Counts') DF[
Landscape | Counts | |
---|---|---|
0 | 0 | 2119 |
1 | 1 | 1274 |
# We want to think of this as categorical!
# 0 = not landscape, 1= landscape
# One way to do this is to turn the data into a string
'landsALL'] = DF['landsALL'].apply(lambda x: str(x)) DF[
Scatterplot Width vs Height - Opacity=0.2
= px.scatter(DF,
fig ='Width_in',
x="Height_in",
y='landsALL',
color=0.2,
opacity='ols')
trendline
="ggplot2",
fig.update_layout(template='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
title=0.5,
title_x="Width (inches)",
xaxis_title="Height (inches)")
yaxis_title
fig.show()
We added another variable by coloring by whether or not there were landscape features in the painting. Then the trendline=‘ols’ now gives us two lines. Here there is some evidence that if a painting in landscape, then it tends to be wider than it is tall.
Why is one line shorter? We could extend this line this is called extrapolation. Be very careful with this because extrapolation can cause a lot of uncertainty. We really don’t have enough data to confidently extend this line.
If we wanted to look at the data for two trendlines we would run:
results = px.get_trendline_results(fig)
info = results.iloc[0]["px_fit_results"].summary()
print(info)
results = px.get_trendline_results(fig)
info = results.iloc[1]["px_fit_results"].summary()
print(info)
Models - upsides and downsides
- Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data.
- There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted.
- Just because you fit a model does not mean you found a true relationship.
Variation around the model.
Variation or Uncertainty is just as important as the model, if not more!
Statistics is the explanation of variation in the context of what remains unexplained.
- The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.
- Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model.
How do we use models?
TWO MAIN USES:
- Explanation: Characterize the relationship between \(y\) and \(x\) via slopes for numerical explanatory variables or differences for categorical explanatory variables
- Prediction: Plug in \(x\), get the predicted \(y\)
Exercise 1 Redo the plot above except color by some other variable that takes values of zero or one. Describe what the ols trendline tells you about the height and width of that type of painting. Eg. Does a painting being described as pastoral mean it is taller or wider than if it is not pastoral?
- Choose a column that has just 0 and 1 as entries
- Change those values into strings using apply and lambda
- Do a value counts and talk about the balance in the data
- Create a scatter plot with an old trendline colored by your focal column
- Describe in words what you plot is telling you.
————————
Predicting the Height of a Painting given the Width - Regression
Now we are ready to build a model for our paintings using Python Machine Learning. We will assume a linear model.
\[ height_i = \beta_1 width_i + \beta_0 \]
We need to install some really powerful packages to help us with modeling and machine learning.
Scikit-Learn
Scikit Learn: https://scikit-learn.org/stable/
!conda install -y scikit-learn
Linear (Least Squares) Regression - Supervised Machine Learning
In a simple two variable case we can actually just solve for the slope and intercept using the properties of linear regression:
- The regression line goes through the center of mass point, the coordinates corresponding to average \(x\) and average \(y\), \((\bar{x}, \bar{y})\):
\[\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}\]
- The slope has the same sign as the correlation coefficient: \(b_1 = r \frac{s_y}{s_x}\)
- The sum of the residuals is zero: \(\sum_{i = 1}^n e_i = 0\)
- The residuals and \(x\) values are uncorrelated
In general the linear regression computation works by assuming there exists a straight line that can predict our data. The process for finding this line needs some more advanced mathematics but here is the idea for our example:
- Choose a straight line \(\hat{y}_i=b_0 + b_1 x_i\) - here \(b_0\) and \(b_1\) are things we hope to learn from our sample statistics.
- See how well that line fits the sample data by calculating the error or cost or mean squared error \[cost = \frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2 \] where \(y_i\) is the value from our sample data and \(\hat{y}_i\) is the value predicted from our model.
- We want this cost to be small so that all of the model predictions are as close as possible to our data. The cost can’t be zero unless our data is in a straight line.
- We change \(b_0\) and \(b_1\) using something called gradient descent to learn what the best possible \(b\) values are.
- Once we can’t make our cost any smaller we stop.
Train a Linear Regression Model
This happens in three steps
- Data Preprocessing
- Model Training
- Analyze the output - predictions
# A new package import!
from sklearn.linear_model import LinearRegression
Preprocessing the Data
Before you can build a model you need to do some cleaning and preprocessing of your data. Here are some important steps:
- Select the variables that you wan to use (columns)
- Decide what to do about NaNs or other strange data
- (advanced) Think about rescaling and standardizing
- Create the inputs and outputs (sometimes encode)
- (advanced) Test - Train split
# Select the variables - I want to predict height given width
= DF[['Height_in','Width_in']]
DF_model DF_model.dtypes
Height_in float64
Width_in float64
dtype: object
# Check out the NaNs
# How many Nans
print('Number of NaNs:')
print(DF_model.isna().sum().sum())
print('----------------------')
# What percent of the data is this?
print('Percent NaNs:')
print(DF_model.isna().sum().sum()/len(DF))
print('----------------------')
# I am going to drop these! This is a choice!
=True)
DF_model.dropna(inplaceprint('Number of NaNs after drop:')
print(DF_model.isna().sum().sum())
print('----------------------')
Number of NaNs:
508
----------------------
Percent NaNs:
0.14972001178897731
----------------------
Number of NaNs after drop:
0
----------------------
# Create the inputs X (explanatory variable) and the outputs y (response variable)
# .values just gets the numbers that we need
# .reshape(-1,1) puts the X data in the right shape.
= DF_model['Width_in'].values.reshape(-1, 1)
X = DF_model['Height_in'].values y
Train the model
- Create the base model, in this case LinearRegression()
- Train the model using the training data
- Look at the results.
# Create linear regression object - a random straight line
= LinearRegression()
LM
# Train the model using the data
LM.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
# The coefficient is the slope
LM.coef_
array([0.78079641])
# We can also see the intercept
LM.intercept_
3.6214055418381896
This means that linear regression found the line
\[ height_i = 3.6214055418381896 + 0.78079641 * width_i \]
so given a width we could predict a height. There appears to be a positive relationship between height and width - in general as the width increases the height increases.
Now we can use our model to make a prediction
Let’s say we know a painting had certain width, we can predict the height
= 33
width = np.array(width).reshape(-1,1)
width LM.predict(width)
array([29.38768703])
= 0
width = np.array(width).reshape(-1,1)
width LM.predict(width)
array([3.62140554])
Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.781 inches.
Intercept: Paintings that are 0 inches wide are expected to be 3.62 inches high, on average. (Does this make sense?)
Correlation does not imply causation
Source: XKCD, Cell phones
= px.scatter(DF,
fig ='Width_in',
x="Height_in",
y=['black'],
color_discrete_sequence=0.2)
opacity
="ggplot2",
fig.update_layout(template='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
title=0.5,
title_x="Width (inches)",
xaxis_title="Height (inches)")
yaxis_title
# Add the line our model would predict
= pd.DataFrame()
DF_model 'w'] = DF['Width_in']
DF_model['y'] = LM.coef_*DF_model['w']+LM.intercept_
DF_model[= DF_model.sort_values('w')
DF_model
fig.add_trace(='w', y='y',color_discrete_sequence=['blue']).data[0]
px.line(DF_model, x
)
fig.show()
Check the accuracy of your model
LM.score() uses the R^2 score:
The R2 score is calculated using the formula \[R^{2}=1-\frac{RSS}{TSS}\], where:
- \(R^{2}\) is the coefficient of determination - \(RSS\) is the sum of squares of residuals - \(TSS\) is the total sum of squares
Here’s what the \(R^2\) score means: - \(R^2\) = 0 The model is not able to predict the outcome any better than using the average value. There is no linear relationship between the variables. - \(R^2\) between 0 and 1 The model can partially predict the outcome, but the estimates are not perfect. - \(R^2\) = 1 The model can perfectly predict the outcome. All points lie exactly on a straight line.
NOTE In a true machine learning workflow you would first separate your data into training and testing sets, so you have some data that your model never saw during training. Then you can see how accurate it is on the testing set after training is complete.
LM.score(X, y)
0.6829467672722757
Exercise 2 Redo the analysis to see if you can use the size (‘Surface’) of the painting to predict the price.
Create a data frame with only the price and surface columns
Look at the data types
Preprocess the data - remove NaNs and change the Surface values to floats.
Train a linear regression model that takes as an input X = Surface Area and gives as an output y=price
X = DF_model['Surface'].values.reshape(-1,1) y = DF_model['price'].values
Plot a scatter plot of Price vs Surface add in your predicted line
Look at the score
Interpret your results. Should you use a linear model to predict the price of a painting using the surface area?
Below you can see the scatter plot and score that I got.
# The score from my model
LM.score(X,y)
0.011141075251450694
What about Categorical Data?
We can use categorical data as our explanatory variable in Linear Regression but this adds one more step to the process. Whenever we are doing linear regression we are trying to find an equation like
\[ \hat{y} = b_0 + b_1 x \]
But what if x has categorical values? For example does the price of a painting depend on whether or not it was landscape?
x = “landscape” or “not landscape”
y = height of painting.
The equation
\[ \hat{y} = b_0 + b_1 landscape \]
does not make sense. So we need to turn the categorical data into a number somehow. In our case we see that we have zeros and ones saved as strings in the ‘landsALL’ column. This means that
0 = “not landscape” 1 = “landscape”
# Change those strings back into ints
'landsALL'] = DF['landsALL'].apply(lambda x: int(x)) DF[
# Select the variables - I want to predict height given width
= DF[['landsALL','Height_in']]
DF_model DF_model.dtypes
landsALL int64
Height_in float64
dtype: object
# Check out the NaNs
# How many Nans
print('Number of NaNs:')
print(DF_model.isna().sum().sum())
print('----------------------')
# What percent of the data is this?
print('Percent NaNs:')
print(DF_model.isna().sum().sum()/len(DF))
print('----------------------')
# I am going to drop these! This is a choice!
=True)
DF_model.dropna(inplaceprint('Number of NaNs after drop:')
print(DF_model.isna().sum().sum())
print('----------------------')
Number of NaNs:
252
----------------------
Percent NaNs:
0.07427055702917772
----------------------
Number of NaNs after drop:
0
----------------------
# Create the inputs X (explanatory variable) and the outputs y (response variable)
= DF_model['landsALL'].values.reshape(-1,1)
X = DF_model['Height_in'].values y
# Create linear regression object - a random straight line
= LinearRegression()
LM
# Train the model using the data
LM.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
print(LM.coef_)
print(LM.intercept_)
[-5.64512764]
22.67991790661878
How do we interpret this result?
\[ Height = 22.68 - 5.65 (landsALL) \]
So if landsALL = 0 (not a landscape) the paintings have an average height of 22.68. If landsALL = 1 (it is a landscape) then the height on average is reduced by 5.65 inches. We found that landscapes tend to be shorter.
More than two categories - One Hot Encoding
In the case above we could easily code the categorical to just be 0 or 1. Because there were only two options. If there are more than two options we have to choose a way to encode the data. There is more than one way to do this. We will do something called one-hot encoding
How doe the school of the painting effect the price?
(A = Austrian, D/FL = Dutch/Flemish, F = French, G = German, I = Italian, S = Spanish, X = Unknown)
from sklearn.preprocessing import OneHotEncoder
= ['school_pntg','price']
my_columns = DF[my_columns]
DF_model DF_model
school_pntg | price | |
---|---|---|
0 | F | 360.0 |
1 | I | 6.0 |
2 | D/FL | 12.0 |
3 | F | 6.0 |
4 | F | 6.0 |
... | ... | ... |
3388 | F | 18.0 |
3389 | F | 25.0 |
3390 | F | 80.0 |
3391 | F | 5.0 |
3392 | F | 5.0 |
3393 rows × 2 columns
'school_pntg'].value_counts() DF_model[
school_pntg
D/FL 1492
F 1386
I 428
X 69
G 9
S 7
A 2
Name: count, dtype: int64
# Get the data we want to predict
= DF_model['school_pntg'].values.reshape(-1,1)
X = DF_model['price'].values
y
# Now because X has lots of categories, we need to encode it:
= OneHotEncoder()
encoder = encoder.fit_transform(X) X
What does the encoded data look like?
The encoded data puts a one in a location representing each category in the data
['A', 'D/FL', 'F', 'G', 'I', 'S', 'X']
so
[1,0,0,0,0,0,0] = A
Here is code to look at the categories:
= encoder.categories_[0]
categories categories
array(['A', 'D/FL', 'F', 'G', 'I', 'S', 'X'], dtype=object)
# Look at each category and encoding
= DF_model.groupby('school_pntg',as_index=False).first()
result = encoder.transform(result['school_pntg'].values.reshape(-1,1))
encoded_data
for i,e in enumerate(encoded_data.toarray()):
print(categories[i])
print(e)
print('---------------------------')
A
[1. 0. 0. 0. 0. 0. 0.]
---------------------------
D/FL
[0. 1. 0. 0. 0. 0. 0.]
---------------------------
F
[0. 0. 1. 0. 0. 0. 0.]
---------------------------
G
[0. 0. 0. 1. 0. 0. 0.]
---------------------------
I
[0. 0. 0. 0. 1. 0. 0.]
---------------------------
S
[0. 0. 0. 0. 0. 1. 0.]
---------------------------
X
[0. 0. 0. 0. 0. 0. 1.]
---------------------------
Now we can train the linear regression just like we did above
# Create linear regression object - a random straight line
= LinearRegression()
LM
# Train the model using the data
LM.fit(X, y)
# Look at the information we get
print(LM.coef_)
print('------------------------------')
print(LM.intercept_)
[-607.0226627 462.24135005 -265.17807213 -650.46710718 -304.23294435
2045.54876584 -680.88932952]
------------------------------
715.0226600517791
What the heck does this mean?!?!?
Well the output coeficients line up with the labels:
Label | Coef |
---|---|
A | -607.0226627 |
D/FL | 462.24135005 |
F | -265.17807213 |
G | -650.46710718 |
I | -304.23294435 |
S | 2045.54876584 |
X | -680.88932952 |
We we can interpret this as the average base cost of a painting was about 715.02, but if the painting was Austrian then on average it sold from 607.02 less but if it was Spanish then on average it sold for 2045.55 more. The equation looks something like this:
\[ Price = 715.02 + -607.0226627(A) + 462.24135005*(D/FL) -265.17807213(F) -650.46710718(G) -304.23294435(I) + 2045.54876584(S) - 680.88932952(X)\]
Where only one of these can be equal to one at a time, all others are zero.
Exercise 3 Redo the analysis above except this time see if the school of the painting effects the overall size (‘Surface’) of the painting. Don’t forget to drop the NaNs where we don’t know the surface size and change the surface variables to floats.
Create a data frame with only the school and surface columns
Look at the data types
Preprocess the data - remove NaNs and change the Surface values to floats.
Train a linear regression model that takes as an input X = School and gives as an output y= surface area, now you need to one hot encode the X values!
X = DF_model['school_pntg'].values.reshape(-1,1) y = DF_model['Surface'].values
Look at the output, what does it mean? - Which school of paintings on average are largest? smallest?
You can see the outputs of my code below.
['A' 'D/FL' 'F' 'G' 'I' 'S' 'X']
------------------------------
[-392.07405544 -266.70347449 122.28771743 -393.21155591 142.65081971
1051.64022987 -264.58968117]
------------------------------
686.0740419935794