Introduction to Data Science

Exam2 Python Commands Cheat Sheet

Author

Joanna Bieri
DATA101

Basic Imports
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

Entering Your Own Data

It is useful to be able to enter your own data. This helps with using Python in your science classes and in redoing plots when you suspect misrepresentation. Here is the process:

  1. Create an empty data frame

     DF = pd.DataFrame()
  2. Enter the data in lists

     column1 = [-,-,-,-]
     column2 = [-,-,-,-]
  3. Add the columns to the data frame

     DF['column1'] = column1
     DF['column2'] = column2

Code examples

Entering data into empty data frame
df = pd.DataFrame()
    
names =['Alice', 'Bob', 'Eve', 'Eve', 'Alice', 'Bob']
examnums = ['one', 'one', 'one', 'two', 'two','two']
grades = [92, 95, 70, 86, 90, 80]

df['name'] = names
df['exam'] = examnums
df['grade'] = grades

df
name exam grade
0 Alice one 92
1 Bob one 95
2 Eve one 70
3 Eve two 86
4 Alice two 90
5 Bob two 80

Renaming Columns

Sometimes it is nice to give your columns easier to use names. If your data is downloaded with hard to remember or hard to type names it is more likely that you will make errors using those names later. I usually try to remove spaces in names and reduce the length of the names. We can use the .rename() function to do this:

DF.rename( columns={ 'old name 1':'new name 1' , 'old name 2':'new name 2' , 'old name 3':'new name 3', ... }, inplace=True )

Here we use curly brackets {} to surround the names. You list the old name first then a colon then the new name. The flag inplace=True tells Python the change the data frame directly, not just in the print out. You can rename as many or as few columns as you want.

Code Example

Renaming columns
df.rename( columns={'name' : 'student_name'}, inplace=True )
df
student_name exam grade
0 Alice one 92
1 Bob one 95
2 Eve one 70
3 Eve two 86
4 Alice two 90
5 Bob two 80

Pivots and Melts - Rearranging Data

Often the data we are given is in an order that is hard to use. To rearrange the data in our data frame we have two options

  • The .pivot() reshapes your data so that one of the columns can become the row labels. To use the pivot method in Pandas, you need to specify three parameters:
    • index: Which column should be used to identify and order your rows vertically
    • columns: Which column should be used to create the new columns in our reshaped DataFrame.
    • values: Which column(s) should be used to fill the values in the cells of our DataFrame.
  • The pd.melt() can take column labels and melt them into your table values. To use the pd.melt() method in Pandas you need to specify three parameters:
    • The DataFrame that you are melting
    • id_vars the list of columns that should remain in your new data frame - used to index the unique rows.
    • var_name the name of the new column that will hold the names of the new rows.
    • value_name the name of the new column that will hold the values of the new rows

Code Example - Pivot

Here I take the data frame above and I want the student names to be the index (index=left hand column). I want the exam numbers to appear as the columns (columns = categorical data use to label each column). I want the grade to be the values (values = data that is inside of the data frame)

code for doing a pivot
df_new = df.pivot(index='student_name', columns='exam', values='grade')
df_new
exam one two
student_name
Alice 92 90
Bob 95 80
Eve 70 86

Code Example - Melt

Here is some new data for exploring the pd.melt() function.

Entering new data
data = {'Name': ['Alice', 'Bob', 'Eve'],
        'Math_2022': [85, 90, 78],
        'Math_2023': [92, 88, 95],
        'Science_2022': [70, 82, 75],
        'Science_2023': [75, 85, 80]}

df = pd.DataFrame(data)
df
Name Math_2022 Math_2023 Science_2022 Science_2023
0 Alice 85 92 70 75
1 Bob 90 88 82 85
2 Eve 78 95 75 80

In this data we want to melt the column labels into the data frame so that they can become categorical information.

To do this we need to first choose the Name to be the id_vars (id_vars = the columns that will remain in the new data set). In this case we sill want a column called Names. The other two variables are what the remaining columns will be named. var_name is the name of the column for the melted data. In this case the column “Subject_Year” will contain the categorical data: Math_2022 Math_2023 Science_2022 Science_2023. The value_name is the name of the column that will hold the data that is inside the old data frame. In this case the column ‘Score’ will be the grades (numbers).

Code for doing a melt
df_new = pd.melt(df, id_vars=['Name'], var_name='Subject_Year', value_name='Score')
df_new
Name Subject_Year Score
0 Alice Math_2022 85
1 Bob Math_2022 90
2 Eve Math_2022 78
3 Alice Math_2023 92
4 Bob Math_2023 88
5 Eve Math_2023 95
6 Alice Science_2022 70
7 Bob Science_2022 82
8 Eve Science_2022 75
9 Alice Science_2023 75
10 Bob Science_2023 85
11 Eve Science_2023 80

Dealing with NaNs

When you find NaN’s in your data, there are some really nice commands for dealing with NaN’s in your data:

  • First NaNs are a strange data type (np.nan) they are considered a float - like a decimal. In most raw data sets NaN means no data was given for that observation and variable, but be careful NaN can also happen if you do a calculation and accidentally divide by zero.

  • .isna() creates a mask for whether or not there is a NaN in each row of the data.

  • .fillna() will replace NaN in your data set with whatever you put inside the parenthesis

  • .dropna() will drop all rows that contain NaN - becareful with this command. You want to keep as much data as possible and .dropna() might delete too much!

I will add some NaNs to our data frame above

Adding NaNs to the data
for i,s in enumerate(df_new['Score']):
    if s <= 80:
        df_new['Score'][i] = np.nan
df_new
Name Subject_Year Score
0 Alice Math_2022 85.0
1 Bob Math_2022 90.0
2 Eve Math_2022 NaN
3 Alice Math_2023 92.0
4 Bob Math_2023 88.0
5 Eve Math_2023 95.0
6 Alice Science_2022 NaN
7 Bob Science_2022 82.0
8 Eve Science_2022 NaN
9 Alice Science_2023 NaN
10 Bob Science_2023 85.0
11 Eve Science_2023 NaN
Counting NaNs in the data
# This creates a mask of whether or not an entry has NaN
print(df_new['Score'].isna())

print('-------------------------')
# We can add these up
print(sum(df_new['Score'].isna()))
0     False
1     False
2      True
3     False
4     False
5     False
6      True
7     False
8      True
9      True
10    False
11     True
Name: Score, dtype: bool
-------------------------
5
Dropping NaNs in the data overall - be careful
df_no_na = df_new.dropna()
df_no_na
Name Subject_Year Score
0 Alice Math_2022 85.0
1 Bob Math_2022 90.0
3 Alice Math_2023 92.0
4 Bob Math_2023 88.0
5 Eve Math_2023 95.0
7 Bob Science_2022 82.0
10 Bob Science_2023 85.0
Masking out NaNs in the data - more focused
# Notice that this does not work!
print(df_new['Score'] != np.nan)

print('-------------------------')
# I prefer to fill those NaNs with sometime I can search for and then do a normal mask
df_new['Score'].fillna(0,inplace=True)
print(df_new['Score'] != 0)


print('-------------------------')
# Here is the code for the mask
mask = df_new['Score'] != 0
df_new[mask]
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
Name: Score, dtype: bool
-------------------------
0      True
1      True
2     False
3      True
4      True
5      True
6     False
7      True
8     False
9     False
10     True
11    False
Name: Score, dtype: bool
-------------------------
Name Subject_Year Score
0 Alice Math_2022 85.0
1 Bob Math_2022 90.0
3 Alice Math_2023 92.0
4 Bob Math_2023 88.0
5 Eve Math_2023 95.0
7 Bob Science_2022 82.0
10 Bob Science_2023 85.0

New Plotting and Visualization Tools

Reverse the order of the categories

Here we see that to read this alphabetically we have to read up from the bottom. Mathematically it makes sense to count along the y-axis starting from zero, but with words it seems a little bit strange. We can do this in the figure layout. We just add:

yaxis={'categoryorder': 'category descending'}

we can also change ‘category descending’ to ‘category ascending’ if we wanted to foce ascending. And we could change from yaxis to xaxis if we had categories along the x-axis.

Fixing category orders

Then we can fix the category order using the command

category_orders={'category name' : [write the order of the categories here]}

Try some new colors

We can add a built in color sequence using the command:

color_discrete_sequence=px.colors.qualitative.????

here you can use tab complete to see what options you have for ????

Apply Templates

You use the command

template="template name"

Here are some built in options: [“plotly”, “plotly_white”, “plotly_dark”, “ggplot2”, “seaborn”, “simple_white”]

Moving the legend.

Plotly will let you place the legend exactly where you want it using the legend layout information:

egend={'orientation':"h",'yanchor':"bottom",'y':1.05, 'xanchor':"right",'x':0.95}
  • ‘orientation’:“h” - “h” means horizontal, “v” means vertical
  • x and y anchor tells plotly where the (0,0) point is. So if we want to put our legend at the bottom right we would use:
    • ‘yanchor’:“bottom”
    • ‘xanchor’:“right”
  • The we can move the legend around using x and y:
    • ‘y’:1.05 - this would shift up above the plot
    • ‘x’:0.95 - this would shift left almost to the left of the plot.

For the graph below I used the Religious Income Data from Day 9.

Plot with lots of options added
fig = px.histogram(DF_new,
             y='religion',
             x='proportion',
             color='income',
             barnorm='percent',
             color_discrete_sequence=px.colors.qualitative.Vivid,
             category_orders={'income' : ['Less than 30,000', '30,000-49,999', '50,000-99,999', '100,000 or more']})

fig.update_layout(template="plotly_white",
                  title='Income Distribution by religious group <br><sup> Data Source: Pew Research Center, Religious Lansdcape Study</sup>',
                  title_x=0.5,
                  yaxis={'categoryorder': 'category descending'},
                  xaxis_title="Proportion",
                  yaxis_title="",
                  legend_title='Income',
                  legend={'orientation':"h",'yanchor':"bottom",'y':-0.2, 'xanchor':"right",'x':1.05},
                  font={'family':"Times",'size':14,'color':"Darkblue"},
                  autosize=False,
                  width=800,
                  height=600)

fig.show()

Using Beautiful Soup to get HTML code

Websites are built using html code. That code tells the web browser (FireFox, Chrome, etc) what to display. Websites can be very simple (just html) to much more complicated (java script +). When you load a website you can always see the source code.

This is what beautiful soup downloads. For static (simple) sites this code is immediately available. More complicated sites might require Python to open the webpage, let the content render, and then download the code.

How to get data from static sites:

import requests
from bs4 import BeautifulSoup

website = 'name of your website'

raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

Use developer tools!

Once you have the soup you can use .find_all() to search for things in the html code.

The .find_all() function searches through the information in soup to match and return only sections that match the info. Here are some important types you might search for:

  • ‘h2’ - this is a heading

  • ‘div’ - this divides a block of information

  • ‘span’ - this divides inline information

  • ‘a’ - this specifies a hyperlink

  • ‘li’ - this is a list item

  • class_= - many things have the class label (notice the underscore!)

  • string= - you can also search by strings.

Code Example

Let’s say I want the country information, here is an example of the code that contains the country information:

    <div class="col-md-4 country">
        <h3 class="country-name">
            <i class="flag-icon flag-icon-ad"></i>
            Andorra
        </h3>
        <div class="country-info">
            <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
            <strong>Population:</strong> <span class="country-population">84000</span><br>
            <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
        </div>
    </div>
Beautiful soup web scraping code
import requests
from bs4 import BeautifulSoup

# Get the data
website = 'https://www.scrapethissite.com/pages/simple/'
raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

# Search the soup
result = soup.find_all('h3',class_="country-name")
DF = pd.DataFrame()
DF['country']=result
DF['country'] =DF['country'].apply(lambda x: x.text.rstrip().lstrip())
DF
country
0 Andorra
1 United Arab Emirates
2 Afghanistan
3 Antigua and Barbuda
4 Anguilla
... ...
245 Yemen
246 Mayotte
247 South Africa
248 Zambia
249 Zimbabwe

250 rows × 1 columns