Introduction to Data Science

Modeling Nonlinear Relationships

Author

Joanna Bieri
DATA101

Important Information

Announcements

Come to Lab! If you need help we are here to help!

Day 16 Assignment - same drill.

  1. Make sure you can Fork and Clone the Day16 repo from Redlands-DATA101
  2. Open the file Day16-HW.ipynb and start doing the problems.
    • You can do these problems as you follow along with the lecture notes and video.
  3. Get as far as you can before class.
  4. Submit what you have so far Commit and Push to Git.
  5. Take the daily check in quiz on Canvas.
  6. Come to class with lots of questions!
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Machine Learning Packages
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression 

Paris Paintings Data

To explore the ideas of modeling data we will use the Paris Paintings dataset.

  • Source: Printed catalogs of 28 auction sales in Paris, 1764 - 1780 (Historical Data)
  • Data curators Sandra van Ginhoven and Hilary Coe Cronheim (who were PhD students in the Duke Art, Law, and Markets Initiative at the time of putting together this dataset) translated and tabulated the catalogs
  • 3393 paintings, their prices, and descriptive details from sales catalogs over 60 variables

Variables in Paris Paintings Data

file_location = 'https://joannabieri.com/introdatascience/data/paris-paintings.csv'
DF_raw_paintings = pd.read_csv(file_location,na_filter=False)
show(DF_raw_paintings)
name sale lot position dealer year origin_author origin_cat school_pntg diff_origin logprice price count subject authorstandard artistliving authorstyle author winningbidder winningbiddertype endbuyer Interm type_intermed Height_in Width_in Surface_Rect Diam_in Surface_Rnd Shape Surface material mat materialCat quantity nfigures engraved original prevcoll othartist paired figures finished lrgfont relig landsALL lands_sc lands_elem lands_figs lands_ment arch mytho peasant othgenre singlefig portrait still_life discauth history allegory pastorale other
Loading ITables v2.1.4 from the internet... (need help?)
Get the data and fix NaNs
# Make a copy of the data that we can start working on
DF = DF_raw_paintings.copy()

# Do something about all those different NaNs
DF.replace('',np.nan,inplace=True)
DF.replace('n/a',np.nan,inplace=True)
DF.replace('NaN',np.nan,inplace=True)
Scatterplot Width vs Height
# Update the types - these should be floats
DF['Height_in'] = DF['Height_in'].apply(lambda x: float(x))
DF['Width_in'] = DF['Width_in'].apply(lambda x: float(x))

fig = px.scatter(DF,x='Width_in',y="Height_in",color_discrete_sequence=['black'])

fig.update_layout(template="ggplot2",
                  title='Height vs Width of Paris Paintings <br><sup> Paris auctions, 1764-1780</sup>',
                  title_x=0.5,
                  xaxis_title="Width (inches)",
                  yaxis_title="Height (inches)")

fig.show()
Preprocessing
# Get the data I want to model
my_columns = ['Height_in','Width_in']
DF_model = DF[my_columns].copy()

# Check out the NaNs

# How many Nans
print('Number of NaNs:')
print(DF_model.isna().sum())
print('----------------------')

# What percent of the data is this?
print('Percent total NaNs:')
print(DF_model.isna().sum().sum()/len(DF))
print('----------------------')

# I am going to drop these! This is a choice!
DF_model.dropna(inplace=True)
print('Number of NaNs after drop:')
print(DF_model.isna().sum().sum())
print('----------------------')
Number of NaNs:
Height_in    252
Width_in     256
dtype: int64
----------------------
Percent total NaNs:
0.14972001178897731
----------------------
Number of NaNs after drop:
0
----------------------
Linear Regression
# Create the inputs X (explanatory variable) and the outputs y (response variable)

X = DF_model['Width_in'].values.reshape(-1, 1)
y = DF_model['Height_in'].values
# Create linear regression object - a random straight line
LM = LinearRegression()

# Train the model using the data
LM.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Print out the coefficients for the model
# The coefficient is the slope
print(LM.coef_)
print(LM.intercept_)

print(LM.score(X,y))
[0.78079641]
3.6214055418381896
0.6829467672722757

Testing for Linearity

How do we know if this is good?

We can always look at the score, \(R^2\), but this only tells us about the average distance away from the prediction each of our data points is. What could cause the metric to be low?

  • Data having a high amount of scatter
  • Data not actually being linear

How do we tell the difference?

\[ Residual = Data Value - Predicted Value \]

Add the prediction and the residual to our data frame!

LM.predict(X) = LM.intercept_ + LM.coef_*DF['Width_in']
DF_model['Height_predicted'] = LM.predict(X)
DF_model['Residual'] = DF_model['Height_in']-DF_model['Height_predicted']
DF_model
Height_in Width_in Height_predicted Residual
0 37.0 29.5 26.654900 10.345100
1 18.0 14.0 14.552555 3.447445
2 13.0 16.0 16.114148 -3.114148
3 14.0 18.0 17.675741 -3.675741
4 14.0 18.0 17.675741 -3.675741
... ... ... ... ...
3388 18.0 21.5 20.408528 -2.408528
3389 13.0 16.5 16.504546 -3.504546
3390 24.0 30.0 27.045298 -3.045298
3391 27.0 23.0 21.579723 5.420277
3392 27.0 23.0 21.579723 5.420277

3135 rows × 4 columns

Plot the residual

Now we can plot the residual - this gives us information about whether or not the linear model was appropriate, even in there is a lot of scatter in our data.

  • Do a scatter plot of the Residual vs. the Predicted Value (Height).
  • Add a line at y=0, to make the residual plot easier to interpret.
fig = px.scatter(DF_model,x='Height_predicted',y='Residual')

# Update layout to show axis line at y=0
fig.update_layout(
    yaxis={'zeroline':True, 'zerolinewidth':1.5, 'zerolinecolor':"black"}
)

fig.update_layout(template="ggplot2",
                  title='Residual as a Function of Predicted Height',
                  title_x=0.5
                 )

fig.show()

Here we see that the Residual seems like a function of Width! It gets wider as the predicted height increases. This is not good. If our data was truly linear, the residual after linear regression should not have a functional dependence. It should only reflect some random scatter in our data!

Interpreting Residual Plots

What we are looking for

  • Residuals distributed randomly around 0
  • With no visible pattern along the x or y axes
Plot with Randomly Generated Data
df = pd.DataFrame({
    'fake_resid': np.random.normal(0, 30, 1000),  # Random normal data for residuals
    'fake_predicted': np.random.uniform(-100, 100, 1000)  # Random uniform data for predicted values
})

fig = px.scatter(df,x='fake_predicted',y='fake_resid')

# Update layout to show axis lines
fig.update_layout(
    yaxis=dict(zeroline=True, zerolinewidth=1.5, zerolinecolor="black")
)

fig.update_layout(template="ggplot2",
                  title='',
                  title_x=0.5
                 )

fig.show()

What we don’t want

Fan shapes

Plot with Randomly Generated Data
# Set random seed for reproducibility
np.random.seed(12346)

# Generate the 'fake_resid' values (a series of normal distributions with different standard deviations)
fake_resid = np.concatenate([
    np.random.normal(0, 1, 100),   # Normal distribution with mean 0, sd 1
    np.random.normal(0, 15, 100),  # Normal distribution with mean 0, sd 15
    np.random.normal(0, 25, 100),  # Normal distribution with mean 0, sd 25
    np.random.normal(0, 20, 100),  # Normal distribution with mean 0, sd 20
    np.random.normal(0, 25, 100),  # Normal distribution with mean 0, sd 25
    np.random.normal(0, 50, 100),  # Normal distribution with mean 0, sd 50
    np.random.normal(0, 35, 100),  # Normal distribution with mean 0, sd 35
    np.random.normal(0, 40, 100),  # Normal distribution with mean 0, sd 40
    np.random.normal(0, 80, 200)   # Normal distribution with mean 0, sd 80
])

# Generate the 'fake_predicted' values (a sequence from 0.2 to 200 with step 0.2)
fake_predicted = np.arange(0.2, 200.2, 0.2)

# Create the DataFrame
df = pd.DataFrame({
    'fake_resid': fake_resid,
    'fake_predicted': fake_predicted
})

fig = px.scatter(df,x='fake_predicted',y='fake_resid')

# Update layout to show axis lines
fig.update_layout(
    yaxis=dict(zeroline=True, zerolinewidth=1.5, zerolinecolor="black")
)

fig.update_layout(template="ggplot2",
                  title='',
                  title_x=0.5
                 )

fig.show()

What we don’t want

Groups of patterns

Plot with Randomly Generated Data
# Set random seed for reproducibility
np.random.seed(12346)

# Generate the 'fake_predicted' values (a sequence from 0.2 to 200 with step 0.2)
fake_predicted = np.arange(0.2, 200.2, 0.2)

# Generate the 'fake_resid' values (two normal distributions)
fake_resid = np.concatenate([
    np.random.normal(-20, 10, 500),  # Normal distribution with mean -20, sd 10
    np.random.normal(10, 10, 500)    # Normal distribution with mean 10, sd 10
])

# Create the DataFrame
df = pd.DataFrame({
    'fake_predicted': fake_predicted,
    'fake_resid': fake_resid
})

fig = px.scatter(df,x='fake_predicted',y='fake_resid')

# Update layout to show axis lines
fig.update_layout(
    yaxis=dict(zeroline=True, zerolinewidth=1.5, zerolinecolor="black")
)

fig.update_layout(template="ggplot2",
                  title='',
                  title_x=0.5
                 )

fig.show()

What we don’t want

Residuals correlated with predicted values

Plot with Randomly Generated Data
# Set random seed for reproducibility

np.random.seed(12346)

# Generate the 'fake_predicted' values (a sequence from 0.2 to 200 with step 0.2)
fake_predicted = np.arange(0.2, 200.2, 0.2)

# Generate 'fake_resid' values by adding normal noise to 'fake_predicted'
fake_resid = fake_predicted + np.random.normal(0, 50, len(fake_predicted))

# Create the DataFrame
df = pd.DataFrame({
    'fake_predicted': fake_predicted,
    'fake_resid': fake_resid
})

fig = px.scatter(df,x='fake_predicted',y='fake_resid')

# Update layout to show axis lines
fig.update_layout(
    yaxis=dict(zeroline=True, zerolinewidth=1.5, zerolinecolor="black")
)

fig.update_layout(template="ggplot2",
                  title='',
                  title_x=0.5
                 )

fig.show()

What we don’t want

Any patterns!

Plot with Randomly Generated Data
# Set random seed for reproducibility
np.random.seed(12346)

# Generate the 'fake_predicted' values (a sequence from -100 to 100 with step 0.4)
fake_predicted = np.arange(-100, 100.4, 0.4)

# Calculate 'fake_resid' values based on the formula with added noise
fake_resid = -5 * fake_predicted**2 - 3 * fake_predicted + 20000 + np.random.normal(0, 10000, len(fake_predicted))

# Create the DataFrame
df = pd.DataFrame({
    'fake_predicted': fake_predicted,
    'fake_resid': fake_resid
})

fig = px.scatter(df,x='fake_predicted',y='fake_resid')

# Update layout to show axis lines
fig.update_layout(
    yaxis=dict(zeroline=True, zerolinewidth=1.5, zerolinecolor="black")
)

fig.update_layout(template="ggplot2",
                  title='',
                  title_x=0.5
                 )

fig.show()

What does the residual plot tell us?

What patterns does the residuals plot reveal that should make us question whether a linear model is a good fit for modeling the relationship between height and width of paintings?

  • We don’t want to see patterns whatsoever.
  • All interesting relationships should have been captured by the linear model
  • Any pattern remaining means that the linear model is maybe not the best fit - there is still something going on here.
Same plot from above
fig = px.scatter(DF_model,x='Height_predicted',y='Residual')

# Update layout to show axis lines
fig.update_layout(
    yaxis=dict(zeroline=True, zerolinewidth=1.5, zerolinecolor="black")
)

fig.update_layout(template="ggplot2",
                  title='Residual as a Function of Predicted Height',
                  title_x=0.5
                 )

fig.show()

We still see a pattern here! The residual is increasing as our predicted height increases. So this means maybe there is something nonlinear going on here. We need to explore our linear assumption!

How do we explore linearity?

At the end of last class you tried to fit a model for Price as a function of size. We tried to use a linear model for this, but you should have found that we got a very low R^2 value. Today we will investigate…

  • Is this because the model is nonlinear?
  • Is this because the data is noisy?
  • Could it be a little bit of both?

Let’s redo that analysis except this time focus only on paintings with area of less than 10,000 inches squared.

Answer to Exercise 2 Day 15 for paintings < 10000 in^2
# Get the columns I care about
my_columns = ['Surface','price']
DF_model2 = DF[my_columns]

# Do some preprocessing - drop NA and make the Surface variable a float
DF_model2.dropna(inplace=True)
DF_model2['Surface'] = DF_model2['Surface'].apply(lambda x: float(x))

# Mask the data for surface area less than 10,000
mask = DF_model2['Surface'] <= 10_000
DF_model2 = DF_model2[mask]

# Make a Scatter plot
fig = px.scatter(DF_model2,
                 x='Surface',
                 y='price',
                 color_discrete_sequence=['black'],
                 opacity=0.2)

fig.update_layout(template="ggplot2",
                  title='Price as a function of Surface Area - total area < 5000 in sq',
                  title_x=0.5,
                  xaxis_title='Surface Area',
                  yaxis_title='Price')


fig.show()

# Create the X and y variables for Linear Regression
X = DF_model2['Surface'].values.reshape(-1,1)
y = DF_model2['price'].values

# Create linear regression object - a random straight line
LM = LinearRegression()
# Train the model using the data
LM.fit(X, y)

# The score from my model
print('Model Score:')
print(LM.score(X,y))
Model Score:
0.014875957473786783

Exercise 1 Create a Residual plot for the model above.

  1. Get the predictions - store these in a column in the data frame
LM.predict(X)
  1. Calculate the residuals - store these in a column in the data frame
'Residual' = 'Real Value in the Data' - 'Value Predicted by LM'
  1. Plot the result
px.scatter(df,x='Value Predicted by LM',y='Residual')

See if you can recreate the plot shown in the lecture.

Q What do you see here? Does this seem uniformly distributed?

What do we do if our data is not linear?

Sometimes we can apply a transformation to our response variable (in this case price) that will help us “unpack” the nonlinearity. Lets look at a histogram of the prices.

Histogram of Prices
fig = px.histogram(DF_model2,x='price',nbins =30)

fig.show()

The price data is very skewed, this means that most of the data is to one side of the histogram. In other words, most paintings sold for less than 5000 (money = livres). Here we see extremely skewed data like this, it looks like a decaying exponential function, so this might influence us to apply a natural log function to the price data!

Use the log()

Remember:

\[\log(e^x) = x\]

The natural log removes the exponential dependence. For us all of these things are the same:

\[\log(x) = \ln(x) = \log_e(x)\]

Below we will apply the natural log to the price column and store that data in a new column in our data frame:

DF_model2['log_price'] = np.log(DF_model2['price'])

Then redo the histogram using the log prices and a scatter plot using the log prices

# Us numpy to take the natural log of the data - removing the exponential decay
DF_model2['log_price'] = np.log(DF_model2['price'])
Histogram of Log Prices
fig = px.histogram(DF_model2,x='log_price',nbins =30)

fig.show()
Scatter Plot of Log Prices
fig = px.scatter(DF_model2,
                 x='Surface',
                 y='log_price',
                 color_discrete_sequence=['black'],
                 opacity=0.2)

fig.update_layout(template="ggplot2",
                  title='log(Price) as a function of Surface Area - total area < 5000 in sq',
                  title_x=0.5,
                  xaxis_title='Surface Area',
                  yaxis_title='log(Price)')


fig.show()

Q What is different about the histogram and the scatter plots after taking the natural log?

Redo linear regression on the log() data

Now lets try to do a linear model again!

Exercise 2: Redo the linear regression analysis except this time use the log_price.

  • Find the linear regression model (LM)
  • Calculate the residual
  • Plot the Residual as a function of the predicted log price
  • Plot a scatter plot of the data with the linear regression line added.

HINT You can see my results in the lecture notes!

LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Coefficient:
[0.0002376]
Intercept:
4.911544880415433

What is the model telling me?

\[ \hat{\log(price)} = 4.912 + 0.00024(Surface Area)\]

How can we interpret this result so it actually makes sense in the context of our problem?


Properties of exponents and logs:

\[e^{log(x)} = x\]

\[\log(a) - \log(b) = log(a/b)\]


If our surface area increase by 1 inch squared how much should our price increase? Lets look at the difference in log prices if we increase by an inch squared.

Plugging into our formula:

\[log(SA +1) - log(SA) = [4.898 + 0.00024(SA +1)] - [4.898 + 0.00024(SA)]\]

doing algebra on the right hand side

\[log(SA +1) - log(SA) = 0.00024\]

using the log subtraction rule

\[ log\left(\frac{SA+1}{SA}\right)= 0.00024\]

using the exponent undoing the log rule

\[ \frac{SA+1}{SA} = e^{0.00024}\]

calculating the exponent on the right hand side

\[ \frac{SA+1}{SA} \sim 1.0002400288023041\]

solving for \(SA+1\)

\[ (SA+1) \sim 1.0002400288023041 * SA\]

So this tells us that increase the area of the painting by one square inch increases the price by a factor of 1.0002400288023041 or about 0.024%.

What did we learn…

There is a small positive increase in the price as the surface area increases, on average. Can we predict the price using the surface area?

LM.score(X,y)
0.013268243020669424

It does not appear that our logistic regression is a good predictor of the price. Even though it looks like we captured a good linear relationship, we do not have a good predictor. The scatter is still very large!

BUT - we are still able to see a linear trend in the model. There is a relationship here even though the data is very noisy!

Recap

  • Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable.

  • The most common transformation when the response variable is right skewed is the log transform: \(log(y)\), especially useful when the response variable is (extremely) right skewed.

  • This transformation is also useful for variance stabilization.

  • When using a log transformation on the response variable the interpretation of the slope changes:

“For each unit increase in x, y is expected on average to be higher/lower by a factor of \(e^{b_1}\).”

  • Another useful transformation is the square root: \(\sqrt{y}\), especially useful when the response variable is counts.

Exercise 3 Redo the full analysis except this time try using just height to predict price.

  • Do a standard linear regression of the height vs. the price (without log) discuss the results. This should include a plot of the residuals and a plot showing the linear fit. You should also talk about what the score, intercept and coefficient of the model are telling you. EG. As the height increases by 1in the price…..
  • Do a linear regression of the height vs. log_price and discuss the results. This should include a plot of the residuals and a plot showing the linear fit. You should also talk about what the score, intercept and coefficient of the model are telling you. EG. As the height increases by 1in the price….. Remember in this case you have to use the rules of logs and exponents to interpret the results.

Which of these models do you think is doing a better job of capturing the functional relationships in the data? Why?