Introduction to Data Science

Modeling Nonlinear Relationships

Author

Joanna Bieri
DATA101

Important Information

Email: joanna_bieri@redlands.edu
Office Hours: Duke 209 Click Here for Joanna’s Schedule

Announcements

Come to Lab! If you need help we are here to help!

Day 16 Assignment - same drill.

Make sure you can Fork and Clone the Day16 repo from Redlands-DATA101
Open the file Day16-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
Get as far as you can before class.
Submit what you have so far Commit and Push to Git.
Take the daily check in quiz on Canvas.
Come to class with lots of questions!

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

Paris Paintings Data

To explore the ideas of modeling data we will use the Paris Paintings dataset.

Source: Printed catalogs of 28 auction sales in Paris, 1764 - 1780 (Historical Data)
Data curators Sandra van Ginhoven and Hilary Coe Cronheim (who were PhD students in the Duke Art, Law, and Markets Initiative at the time of putting together this dataset) translated and tabulated the catalogs
3393 paintings, their prices, and descriptive details from sales catalogs over 60 variables

Variables in Paris Paintings Data

file_location = 'https://joannabieri.com/introdatascience/data/paris-paintings.csv'
DF_raw_paintings = pd.read_csv(file_location,na_filter=False)

show(DF_raw_paintings)

name	sale	lot	position	dealer	year	origin_author	origin_cat	school_pntg	diff_origin	logprice	price	count	subject	authorstandard	artistliving	authorstyle	author	winningbidder	winningbiddertype	endbuyer	Interm	type_intermed	Height_in	Width_in	Surface_Rect	Diam_in	Surface_Rnd	Shape	Surface	material	mat	materialCat	quantity	nfigures	engraved	original	prevcoll	othartist	paired	figures	finished	lrgfont	relig	landsALL	lands_sc	lands_elem	lands_figs	lands_ment	arch	mytho	peasant	othgenre	singlefig	portrait	still_life	discauth	history	allegory	pastorale	other
Loading ITables v2.1.4 from the internet... (need help?)

Get the data and fix NaNs

# Make a copy of the data that we can start working on
DF = DF_raw_paintings.copy()

# Do something about all those different NaNs
DF.replace('',np.nan,inplace=True)
DF.replace('n/a',np.nan,inplace=True)
DF.replace('NaN',np.nan,inplace=True)

Scatterplot Width vs Height

# Update the types - these should be floats
DF['Height_in'] = DF['Height_in'].apply(lambda x: float(x))
DF['Width_in'] = DF['Width_in'].apply(lambda x: float(x))

fig = px.scatter(DF,x='Width_in',y="Height_in",color_discrete_sequence=['black'])

fig.update_layout(template="ggplot2",
                  title='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
                  title_x=0.5,
                  xaxis_title="Width (inches)",
                  yaxis_title="Height (inches)")

fig.show()

Preprocessing

# Get the data I want to model
my_columns = ['Height_in','Width_in']
DF_model = DF[my_columns].copy()

# Check out the NaNs

# How many Nans
print('Number of NaNs:')
print(DF_model.isna().sum())
print('----------------------')

# What percent of the data is this?
print('Percent total NaNs:')
print(DF_model.isna().sum().sum()/len(DF))
print('----------------------')

# I am going to drop these! This is a choice!
DF_model.dropna(inplace=True)
print('Number of NaNs after drop:')
print(DF_model.isna().sum().sum())
print('----------------------')

Number of NaNs:
Height_in    252
Width_in     256
dtype: int64
----------------------
Percent total NaNs:
0.14972001178897731
----------------------
Number of NaNs after drop:
0
----------------------

Linear Regression

# Create the inputs X (explanatory variable) and the outputs y (response variable)

X = DF_model['Width_in'].values.reshape(-1, 1)
y = DF_model['Height_in'].values
# Create linear regression object - a random straight line
LM = LinearRegression()

# Train the model using the data
LM.fit(X, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Print out the coefficients for the model

# The coefficient is the slope
print(LM.coef_)
print(LM.intercept_)

print(LM.score(X,y))

[0.78079641]
3.6214055418381896
0.6829467672722757

Testing for Linearity

How do we know if this is good?

We can always look at the score, \(R^2\), but this only tells us about the average distance away from the prediction each of our data points is. What could cause the metric to be low?

Data having a high amount of scatter
Data not actually being linear

How do we tell the difference?

\[ Residual = Data Value - Predicted Value \]

Add the prediction and the residual to our data frame!

LM.predict(X) = LM.intercept_ + LM.coef_*DF['Width_in']

DF_model['Height_predicted'] = LM.predict(X)
DF_model['Residual'] = DF_model['Height_in']-DF_model['Height_predicted']
DF_model

	Height_in	Width_in	Height_predicted	Residual
0	37.0	29.5	26.654900	10.345100
1	18.0	14.0	14.552555	3.447445
2	13.0	16.0	16.114148	-3.114148
3	14.0	18.0	17.675741	-3.675741
4	14.0	18.0	17.675741	-3.675741
...	...	...	...	...
3388	18.0	21.5	20.408528	-2.408528
3389	13.0	16.5	16.504546	-3.504546
3390	24.0	30.0	27.045298	-3.045298
3391	27.0	23.0	21.579723	5.420277
3392	27.0	23.0	21.579723	5.420277

3135 rows × 4 columns

Plot the residual

Now we can plot the residual - this gives us information about whether or not the linear model was appropriate, even in there is a lot of scatter in our data.

Do a scatter plot of the Residual vs. the Predicted Value (Height).
Add a line at y=0, to make the residual plot easier to interpret.

fig = px.scatter(DF_model,x='Height_predicted',y='Residual')

# Update layout to show axis line at y=0
fig.update_layout(
    yaxis={'zeroline':True, 'zerolinewidth':1.5, 'zerolinecolor':"black"}
)

fig.update_layout(template="ggplot2",
                  title='Residual as a Function of Predicted Height',
                  title_x=0.5
                 )

fig.show()

Here we see that the Residual seems like a function of Width! It gets wider as the predicted height increases. This is not good. If our data was truly linear, the residual after linear regression should not have a functional dependence. It should only reflect some random scatter in our data!