Introduction to Data Science

Reading Data and Data Types

Author

Joanna Bieri
DATA101

Important Information

Email: joanna_bieri@redlands.edu
Office Hours: Duke 209 Click Here for Joanna’s Schedule

Announcements

Exam 1 - what to expect…

Please come to office hours to get help!

Day 8 Assignment - same drill.

Make sure you can Fork and Clone the Day8 repo from Redlands-DATA101
Open the file Day8-HW.ipynb and start doing the problems.
- You can do these problems as you follow along with the lecture notes and video.
Get as far as you can before class.
Submit what you have so far Commit and Push to Git.
Take the daily check in quiz on Canvas.
Come to class with lots of questions!

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

Reading in Your Own Data

A big part of data science is finding your own data and then analyzing it. To be able to use Python to analyze data you first have to learn how to load data into a JupyterLab Notebook. There are lots of different ways, but in this class we will used Pandas (pd) to read data.

pd.read_csv This command lets you read files that end in .csv. You can create a file like this using a spreadsheet program and saving the file as a .csv. Often you can download data in .csv format. You will need to make sure that your data is in the same folder as the code you are writing or that you can tell your code EXACTLY where the file is.
pd.read_excell This command lets you read files created by excell. Typicall with .xlxs file types. You will need to make sure that your data is in the same folder as the code you are writing or that you can tell your code EXACTLY where the file is.
pd.read_html This command will try to automatically download table type data from a given website. You will need an internet connection to use this.
pd.read_json This command lets you read .json files. JSON stand for JavaScript Object Notation. It is a very popular way to save data. You will need to make sure that your data is in the same folder as the code you are writing or that you can tell your code EXACTLY where the file is.
pd.read_table If other methods fail, sometimes you can force Pandas to read a file by using the table method. You will need to make sure that your data is in the same folder as the code you are writing or that you can tell your code EXACTLY where the file is. You will have to tell what your delimiter is - what character separates the columns (sometimes this is a comma like in .csv, but sometimes it is something else like a colon or space)

Try reading in some data - csv

Go to the Cal Fire Website and scroll to the bottom to see the Incident Data. We will download the file named ALL DATA AS CSV this should put the data file into your Downloads folder.

Next you need to move the file mapdataall.csv from your Downloads folder into your Day8 folder where you are doing your homework. You can open your Downloads folder and drag the file into JupyterLab side bar. Then I can run the command

DF_raw = pd.read_csv('mapdataall.csv')

to load the data and look at the data frame.

DF = pd.read_csv('mapdataall.csv')

show(DF)

incident_name	incident_is_final	incident_date_last_update	incident_date_created	incident_administrative_unit	incident_administrative_unit_url	incident_county	incident_location	incident_acres_burned	incident_containment	incident_control	incident_cooperating_agencies	incident_longitude	incident_latitude	incident_type	incident_id	incident_url	incident_date_extinguished	incident_dateonly_extinguished	incident_dateonly_created	is_active	calfire_incident	notification_desired
Loading ITables v2.1.4 from the internet... (need help?)

Answer some questions about this data

Q How many variables and observations?

Q How many different incident types are there?

Q Make a histogram of the acres burned and color the bars by whether or not the incident was a calfire incident. You will probably need to make a mask to remove very small and very large fires. How many fires burned more than 100,000 acres? What is the largest fire in the data?

Q EXTRA - CHALLENGE - See if you can create a graph that answers the question: Are fires getting bigger or more frequent over time? You get complete creative control on how to answer this question!

Try reading some data - html

Now we will define a variable:

my_website = 'https://www.fire.ca.gov/incidents'

and see what happens when we used the pd.read_html() command.

First we need to install the lxml module… you can always install packages by running: conda install -y package-name

!conda install -y lxml

Retrieving notices: ...working... done
Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/bellajagu/anaconda3/envs/DATA101

  added / updated specs:
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libxslt-1.1.39             |       h76b75d6_0         248 KB  conda-forge
    lxml-5.3.0                 |  py312he28fd5a_1         1.3 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.6 MB

The following NEW packages will be INSTALLED:

  libxslt            conda-forge/linux-64::libxslt-1.1.39-h76b75d6_0 
  lxml               conda-forge/linux-64::lxml-5.3.0-py312he28fd5a_1 

The following packages will be UPDATED:

  openssl                                  3.3.1-hb9d3cd8_3 --> 3.3.2-hb9d3cd8_0 



Downloading and Extracting Packages:
lxml-5.3.0           | 1.3 MB    |                                       |   0% 
libxslt-1.1.39       | 248 KB    |                                       |   0% 
lxml-5.3.0           | 1.3 MB    | 4                                     |   1% 
                                                                                
                                                                                
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

my_website = 'https://www.fire.ca.gov/incidents'
DF = pd.read_html(my_website)

DF

[          Incident                     Counties    Started  Acres Containment
 0        Soda Fire              San Luis Obispo  9/30/2024     80          0%
 1       Ridge Fire                Santa Barbara  9/29/2024     17         30%
 2     Airport Fire            Orange, Riverside  9/09/2024  23526         95%
 3      Bridge Fire  Los Angeles, San Bernardino  9/08/2024  54878         98%
 4        Line Fire               San Bernardino  9/05/2024  43251         80%
 5  Coffee Pot Fire                       Tulare  8/03/2024  14104         93%]

Notice that this returns a LIST of DataFrames. I know this because it is not printed like a DataFrame and I can see the square brackets. To get the data frame I need to tell Python which item in the list I want to see. Since there is only one item on the list I can use

DF[0]

to get the DataFrame.

DF[0]

	Incident	Counties	Started	Acres	Containment
0	Soda Fire	San Luis Obispo	9/30/2024	80	0%
1	Ridge Fire	Santa Barbara	9/29/2024	17	30%
2	Airport Fire	Orange, Riverside	9/09/2024	23526	95%
3	Bridge Fire	Los Angeles, San Bernardino	9/08/2024	54878	98%
4	Line Fire	San Bernardino	9/05/2024	43251	80%
5	Coffee Pot Fire	Tulare	8/03/2024	14104	93%

read_html() opens the website and then tries to read in any table it finds there. You will see that we get less data than if we can find a .csv to download.

Try reading in some data from Wikipedia

Here we will explore academy award winning films. Go to the Wiki for the List of Academy Award Winning Films. Look at what type of data is there. How many tables? Any weird looking data?

Now read the html data into Python

my_website = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"
DF = pd.read_html(my_website)

More than one table

You will notice that this results in a list of two different data frames! DF[0] contains all the movies from the main table on the website and DF[1] contains information about other aspects of the awards.

Let’s try a little analysis of the data in DF[0]

DF_raw = DF[0]
show(DF_raw)

Film	Year	Awards	Nominations
Loading ITables v2.1.4 from the internet... (need help?)

DF_raw.describe()

	Film	Year	Awards	Nominations
count	1373	1373	1373	1373
unique	1359	96	22	24
top	Little Women	1945	1	1
freq	3	21	979	550

Notice that we are getting very limited statistical information!! What is going wrong here?

Something is wrong with our data!

What should the data types be:

Film - should be categorical (a word or string)
Year - should be discrete numerical (a number or integer)
Awards - should be discrete numerical (a number or integer)
Nominations - should be discrete numerical (a number of integer)

So we should be able to get good statistics on ‘Year’, ‘Awards’, and ‘Nominations’. Let’s sub select just these columns.

my_columns = ['Year', 'Awards', 'Nominations']
DF_raw[my_columns].describe()

	Year	Awards	Nominations
count	1373	1373	1373
unique	96	22	24
top	1945	1	1
freq	21	979	550

This still is not making sense. What if I wanted to know the average number of awards?

Something is really wrong with our data - I get an error!

DF_raw['Awards'].mean()

TypeError: Could not convert string '7421111111111742111111111136211111111111132212221121111142111111223111113433111111111114322221111111110 (1)[1]36221111111111126311111111111114431111111111111373221111111343222111111111552111111111111443222111111163222111111111832211111111143221111111111111432221111111111133331111111111114522111111111111122111111111632211111111111442211111111111154411111111115422111111111175311111111111222111111192111111111111152222111111111110 (1)6222111111111111733211111111433211111111111542221111111173211111111110 (1)432221111111111433 (1)11111111111119311111111110 (1)43321111111111117221111111111118322111111110 (1)544211111110 (1)842111111111144 (1)3321111114322211 (1)1111154221111111110 (1)532111111111110 (1)46 (1)31 (1)111111113442111111110 (1)0 (1)5431111111110 (2)632111 (1)11111117221111111111381 (1)1111111115322211111111721111111111111343211111111115 (1)3211111111111110 (1)532221111111116532221111115532111111110 (1)321311111162131161112145115115211211111117 (1)10 (1)1120 (1)1112111111111121211711111111113113121111130 (1)111411111211111111111111112111113111111211111131111110 (2)211211111110 (1)0 (1)1180 (1)21 (1)13119111171118 (2)2111112121113211411114110 (1)2114111110 (1)1531311211111120 (1)4111510 (1)22 (1)111122111512111110 (1)1111117111231110 (2)11111224311111114511211113210 (1)11111211161111811221111111111111211181211111112221112610 (1)1111111111112110 (1)1121221323120 (1)141111 (1)1112121111110 (1)1111110 (1)1111241 (1)111140 (1)111121111 (1)13114111331110 (1)1110 (1)1111131111112111111111111134111111131111211111111111110 (1)1111101111111111521111211111111312123' to numeric

Help my data is broken!

What you see happening here is that the data is not in the format that you expected. Now as a data scientist you have to decide what to do!

Data Types

In python there are four data types that we will work with most:

String - this is a word, sentence, letter… surrounded by quotes.
Integer - this is a number with no decimal points - fractions not allowed (rounded)
Float - this is a number with decimal places - fractions allowed
Boolean - this is True or False

Let’s decipher the error message:

“Could not convert string — to numeric.”

This is telling us that when Python tried to take that average it ran into strings (words). It then tried to convert those words into numbers, but ran into problems. So it cannot take the average here.

Data Type Exploration

Below are some variables that contain different data - we will use the type() command to see what data type they are.

num1 = 1
num2 = 1.0
num3 = '1'
num4 = 1.6
num5 = '1.2'

print(type(num1))
print(type(num2))
print(type(num3))
print(type(num4))
print(type(num5))

<class 'int'>
<class 'float'>
<class 'str'>
<class 'float'>
<class 'str'>

Does this result match our definitions above?

Changing Data Types

Now what if we wanted to change these data types from on type to another? We can use the commands:

str() - to change to a string
int() - to change to an integer
float() - to change to a float

Intentional Converting Between Types:

num1_str = str(num1)
print(type(num1_str))
num1_str

<class 'str'>

'1'

num2_int = int(num2)
print(type(num2_int))
num2_int

<class 'int'>

num3_int = int(num3)
print(type(num3_int))
num3_int

<class 'int'>

# int() will always round down!
num4_int = int(num4)
print(type(num4_int))
num4_int

<class 'int'>

num5_float = float(num5)
print(type(num5_float))
num5_float

<class 'float'>

1.2

Unintentional Converting Between Types

Sometimes Python functions will automatically convert between types or make type decisions for you!

Here is an example of adding an integer and a float. Python returns the result as a float.

num1 = 1
num2 = 1.0

type(num1+num2)
num1+num2

2.0

Here we see that even though we send in integers, when we take the average it returns a float.

my_numbers = [2,2,2,2]
np.mean(my_numbers)

2.0

Fun with Strings!

Many times your data issues will arise when someone enters data in a wrong format. This often means that you have to deal with strings that need to be fixed. Here are some useful commands

.lstrip() - removes left white space
.rstrip() - removes right white space
.replace(string1, string2) - replaces string1 with string2 as long as string1 is part of the original string
.split(character) - splits the string into chunks based on the character provided

There are lots more, but this is enough to learn for now!

my_string = '   Year: 2024/25   '
len(my_string)

my_string = my_string.lstrip()
len(my_string)

my_string = my_string.rstrip()
len(my_string)

my_string = my_string.replace(':',',')
my_string

'Year, 2024/25'

my_broken_string = my_string.split('/')
my_broken_string

['Year, 2024', '25']

# Now we can select just one part of the original string
my_broken_string[1]

'25'

So How Can I Fix my Data?

Remember that this entire problem started when we were exploring the movie data stored in DF_raw. Let’s learn a new command to help us see what type of data is in the DataFrame. To report the type of data in each variable use the command

DF_raw.dtypes

DF_raw.dtypes

Film           object
Year           object
Awards         object
Nominations    object
dtype: object

object means that the data is a string or that there is at least one string in the column. Does it make sense for year to be a string?

year_data = DF_raw['Year'].value_counts()
DF_year = year_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_year

	Year	count
0	1945	21
1	1949	20
2	1942	20
3	1950	19
4	1948	19
...	...	...
91	1931/32	10
92	1932/33	9
93	1928/29	7
94	1929/30	6
95	1930/31	6

96 rows × 2 columns

There are some years that look like normal integer years and others that are formatted like 1931/32. How can we fix this so we can do our analysis?

This is more of an art than a science - as a data scientist you need to make these types of decisions and keep track of why and how you make the decision. For the data above, my plan is to remove the extra data /32 and just use the first part of the string for year.

.apply() and lambda

The .apply() command lets you apply a function to every row along your data frame.
lambda is a special function you first write the word lambda and choose a variable followed by a colon
```
lambda x:
```
then you give an operation to do to every x (in this case every entry in the “Year” column)
```
x.split('/')[0]
```
this will split up the string so that 1930/31 becomes [1930, 31] then the [0] just selects the first part of the list.

We can apply this operation and save the information over the “Year” column so the data will be in a better format!

DF_raw['Year'] = DF_raw['Year'].apply(lambda x: x.split('/')[0])

year_data = DF_raw['Year'].value_counts()
DF_year = year_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_year

	Year	count
0	1945	21
1	1949	20
2	1942	20
3	1950	19
4	1948	19
...	...	...
91	1931	10
92	1932	9
93	1928	7
94	1929	6
95	1930	6

96 rows × 2 columns

Now we will turn all the data in the year column into integers. There are lots of ways to do this, but lets practice the .apply() with a lambda method.

DF_raw['Year'] = DF_raw['Year'].apply(lambda x: int(x))

DF_raw.dtypes

Film           object
Year            int64
Awards         object
Nominations    object
dtype: object

Notice that we fixed the year column! Lets start working on the other columns.

Do the same type of thing to the Awards column

Look at what kind of data is in the “Awards” column.

award_data = DF_raw['Awards'].value_counts()
DF_award = award_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_award

	Awards	count
0	1	979
1	2	156
2	3	77
3	4	47
4	0 (1)	32
5	5	25
6	7	12
7	6	12
8	1 (1)	8
9	8	7
10	11	3
11	9	3
12	0 (2)	3
13	0 (1)[1]	1
14	3 (1)	1
15	4 (1)	1
16	6 (1)	1
17	5 (1)	1
18	7 (1)	1
19	8 (2)	1
20	2 (1)	1
21	10	1

What the heck are these extra () and [] numbers?

To understand this better I had to go back to the website. It seems that things in parenthesis are special awards and things in square brackets are citations.

So I will chose to remove this information. THIS IS JUST ONE CHOICE - alternatively we could choose to add another column just for special awards if these were important to our analysis.

Looking at the data, I only want the numbers to the left of a ( so I will use this character to split my strings. I will also turn all the data into integer types.

DF_raw['Awards'] = DF_raw['Awards'].apply(lambda x: int(x.split('(')[0]))

award_data = DF_raw['Awards'].value_counts()
DF_award = award_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_award

	Awards	count
0	1	987
1	2	157
2	3	78
3	4	48
4	0	36
5	5	26
6	7	13
7	6	13
8	8	8
9	11	3
10	9	3
11	10	1

DF_raw.dtypes

Film           object
Year            int64
Awards          int64
Nominations    object
dtype: object

You Try - Nominations column

Q Now it is your turn. Fix the Nominations column so that all the data is in integer format.

Hint

First look at the data and notice that the [ character seems to split the number of nominations from something that looks like a reference.

My Answer

nom_data = DF_raw['Nominations'].value_counts()
DF_nom = nom_data.reset_index().rename(columns={"index": "value", 0: "count"})
show(DF_nom)
print('I can see that I want the data to the left of the [ character')

DF_raw['Nominations'] = DF_raw['Nominations'].apply(lambda x: int(x.split('[')[0]))
award_data = DF_raw['Awards'].value_counts()
DF_award = award_data.reset_index().rename(columns={"index": "value", 0: "count"})
show(DF_award)

DF_raw.dtypes

If everything went to plan you should see:

DF_raw.dtypes

Film           object
Year            int64
Awards          int64
Nominations     int64
dtype: object

Now we can do some statistics and continue our analysis!

DF_raw.describe()

	Year	Awards	Nominations
count	1373.000000	1373.000000	1373.000000
mean	1975.864530	1.573197	3.606701
std	27.416755	1.392371	3.125371
min	1927.000000	0.000000	0.000000
25%	1952.000000	1.000000	1.000000
50%	1975.000000	1.000000	2.000000
75%	2000.000000	2.000000	5.000000
max	2023.000000	11.000000	14.000000

Warning - Be Kind!

Fixing messy data can be a challenging and complicated process. Please be nice to yourself!

What is expected in this class:

You can find some data online and load it into Python
You can figure out what types of data are in each column
You can say in words what format you would like the data to be in.
You can patiently TRY to fix the data.
You come get help from your Professor, TA, or friends before you get too frustrated!

Try to find some data of your own.

Your homework today will be to see if you can find some data of your own. This can be the first steps you take toward your final project.