Introduction to Data Science

Exam2 Python Commands Cheat Sheet

Author

Joanna Bieri
DATA101

Basic Imports

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

Entering Your Own Data

It is useful to be able to enter your own data. This helps with using Python in your science classes and in redoing plots when you suspect misrepresentation. Here is the process:

Create an empty data frame
```
 DF = pd.DataFrame()
```

Enter the data in lists

 column1 = [-,-,-,-]
 column2 = [-,-,-,-]

Add the columns to the data frame

 DF['column1'] = column1
 DF['column2'] = column2

Code examples

Entering data into empty data frame

df = pd.DataFrame()
    
names =['Alice', 'Bob', 'Eve', 'Eve', 'Alice', 'Bob']
examnums = ['one', 'one', 'one', 'two', 'two','two']
grades = [92, 95, 70, 86, 90, 80]

df['name'] = names
df['exam'] = examnums
df['grade'] = grades

df

	name	exam	grade
0	Alice	one	92
1	Bob	one	95
2	Eve	one	70
3	Eve	two	86
4	Alice	two	90
5	Bob	two	80

Renaming Columns

Sometimes it is nice to give your columns easier to use names. If your data is downloaded with hard to remember or hard to type names it is more likely that you will make errors using those names later. I usually try to remove spaces in names and reduce the length of the names. We can use the .rename() function to do this:

DF.rename( columns={ 'old name 1':'new name 1' , 'old name 2':'new name 2' , 'old name 3':'new name 3', ... }, inplace=True )

Here we use curly brackets {} to surround the names. You list the old name first then a colon then the new name. The flag inplace=True tells Python the change the data frame directly, not just in the print out. You can rename as many or as few columns as you want.

Code ExampleRenaming columnsdf.rename( columns={'name' : 'student_name'}, inplace=True )
df






student_name
exam
grade


0
Alice
one
92

1
Bob
one
95

2
Eve
one
70

3
Eve
two
86

4
Alice
two
90

5
Bob
two
80

Pivots and Melts - Rearranging Data

Often the data we are given is in an order that is hard to use. To rearrange the data in our data frame we have two options

The .pivot() reshapes your data so that one of the columns can become the row labels. To use the pivot method in Pandas, you need to specify three parameters:
- index: Which column should be used to identify and order your rows vertically
- columns: Which column should be used to create the new columns in our reshaped DataFrame.
- values: Which column(s) should be used to fill the values in the cells of our DataFrame.
The pd.melt() can take column labels and melt them into your table values. To use the pd.melt() method in Pandas you need to specify three parameters:
- The DataFrame that you are melting
- id_vars the list of columns that should remain in your new data frame - used to index the unique rows.
- var_name the name of the new column that will hold the names of the new rows.
- value_name the name of the new column that will hold the values of the new rows

Code Example - Pivot

Here I take the data frame above and I want the student names to be the index (index=left hand column). I want the exam numbers to appear as the columns (columns = categorical data use to label each column). I want the grade to be the values (values = data that is inside of the data frame)

code for doing a pivot

df_new = df.pivot(index='student_name', columns='exam', values='grade')
df_new

exam	one	two
student_name
Alice	92	90
Bob	95	80
Eve	70	86

Code Example - Melt

Here is some new data for exploring the pd.melt() function.

Entering new data

data = {'Name': ['Alice', 'Bob', 'Eve'],
        'Math_2022': [85, 90, 78],
        'Math_2023': [92, 88, 95],
        'Science_2022': [70, 82, 75],
        'Science_2023': [75, 85, 80]}

df = pd.DataFrame(data)
df

	Name	Math_2022	Math_2023	Science_2022	Science_2023
0	Alice	85	92	70	75
1	Bob	90	88	82	85
2	Eve	78	95	75	80

In this data we want to melt the column labels into the data frame so that they can become categorical information.

To do this we need to first choose the Name to be the id_vars (id_vars = the columns that will remain in the new data set). In this case we sill want a column called Names. The other two variables are what the remaining columns will be named. var_name is the name of the column for the melted data. In this case the column “Subject_Year” will contain the categorical data: Math_2022 Math_2023 Science_2022 Science_2023. The value_name is the name of the column that will hold the data that is inside the old data frame. In this case the column ‘Score’ will be the grades (numbers).

Code for doing a melt

df_new = pd.melt(df, id_vars=['Name'], var_name='Subject_Year', value_name='Score')
df_new

	Name	Subject_Year	Score
0	Alice	Math_2022	85
1	Bob	Math_2022	90
2	Eve	Math_2022	78
3	Alice	Math_2023	92
4	Bob	Math_2023	88
5	Eve	Math_2023	95
6	Alice	Science_2022	70
7	Bob	Science_2022	82
8	Eve	Science_2022	75
9	Alice	Science_2023	75
10	Bob	Science_2023	85
11	Eve	Science_2023	80

Dealing with NaNs

When you find NaN’s in your data, there are some really nice commands for dealing with NaN’s in your data:

First NaNs are a strange data type (np.nan) they are considered a float - like a decimal. In most raw data sets NaN means no data was given for that observation and variable, but be careful NaN can also happen if you do a calculation and accidentally divide by zero.
.isna() creates a mask for whether or not there is a NaN in each row of the data.
.fillna() will replace NaN in your data set with whatever you put inside the parenthesis
.dropna() will drop all rows that contain NaN - becareful with this command. You want to keep as much data as possible and .dropna() might delete too much!

I will add some NaNs to our data frame above

Adding NaNs to the data

for i,s in enumerate(df_new['Score']):
    if s <= 80:
        df_new['Score'][i] = np.nan
df_new

	Name	Subject_Year	Score
0	Alice	Math_2022	85.0
1	Bob	Math_2022	90.0
2	Eve	Math_2022	NaN
3	Alice	Math_2023	92.0
4	Bob	Math_2023	88.0
5	Eve	Math_2023	95.0
6	Alice	Science_2022	NaN
7	Bob	Science_2022	82.0
8	Eve	Science_2022	NaN
9	Alice	Science_2023	NaN
10	Bob	Science_2023	85.0
11	Eve	Science_2023	NaN

Counting NaNs in the data

# This creates a mask of whether or not an entry has NaN
print(df_new['Score'].isna())

print('-------------------------')
# We can add these up
print(sum(df_new['Score'].isna()))

0     False
1     False
2      True
3     False
4     False
5     False
6      True
7     False
8      True
9      True
10    False
11     True
Name: Score, dtype: bool
-------------------------
5

Dropping NaNs in the data overall - be careful

df_no_na = df_new.dropna()
df_no_na

	Name	Subject_Year	Score
0	Alice	Math_2022	85.0
1	Bob	Math_2022	90.0
3	Alice	Math_2023	92.0
4	Bob	Math_2023	88.0
5	Eve	Math_2023	95.0
7	Bob	Science_2022	82.0
10	Bob	Science_2023	85.0

Masking out NaNs in the data - more focused

# Notice that this does not work!
print(df_new['Score'] != np.nan)

print('-------------------------')
# I prefer to fill those NaNs with sometime I can search for and then do a normal mask
df_new['Score'].fillna(0,inplace=True)
print(df_new['Score'] != 0)


print('-------------------------')
# Here is the code for the mask
mask = df_new['Score'] != 0
df_new[mask]

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
Name: Score, dtype: bool
-------------------------
0      True
1      True
2     False
3      True
4      True
5      True
6     False
7      True
8     False
9     False
10     True
11    False
Name: Score, dtype: bool
-------------------------

	Name	Subject_Year	Score
0	Alice	Math_2022	85.0
1	Bob	Math_2022	90.0
3	Alice	Math_2023	92.0
4	Bob	Math_2023	88.0
5	Eve	Math_2023	95.0
7	Bob	Science_2022	82.0
10	Bob	Science_2023	85.0

New Plotting and Visualization Tools

Reverse the order of the categories

Here we see that to read this alphabetically we have to read up from the bottom. Mathematically it makes sense to count along the y-axis starting from zero, but with words it seems a little bit strange. We can do this in the figure layout. We just add:

yaxis={'categoryorder': 'category descending'}

we can also change ‘category descending’ to ‘category ascending’ if we wanted to foce ascending. And we could change from yaxis to xaxis if we had categories along the x-axis.

Fixing category orders

Then we can fix the category order using the command

category_orders={'category name' : [write the order of the categories here]}

Try some new colors

We can add a built in color sequence using the command:

color_discrete_sequence=px.colors.qualitative.????

here you can use tab complete to see what options you have for ????

Apply Templates

You use the command

template="template name"

Here are some built in options: [“plotly”, “plotly_white”, “plotly_dark”, “ggplot2”, “seaborn”, “simple_white”]

Moving the legend.

Plotly will let you place the legend exactly where you want it using the legend layout information:

egend={'orientation':"h",'yanchor':"bottom",'y':1.05, 'xanchor':"right",'x':0.95}

‘orientation’:“h” - “h” means horizontal, “v” means vertical
x and y anchor tells plotly where the (0,0) point is. So if we want to put our legend at the bottom right we would use:
- ‘yanchor’:“bottom”
- ‘xanchor’:“right”
The we can move the legend around using x and y:
- ‘y’:1.05 - this would shift up above the plot
- ‘x’:0.95 - this would shift left almost to the left of the plot.

For the graph below I used the Religious Income Data from Day 9.

Plot with lots of options added

fig = px.histogram(DF_new,
             y='religion',
             x='proportion',
             color='income',
             barnorm='percent',
             color_discrete_sequence=px.colors.qualitative.Vivid,
             category_orders={'income' : ['Less than 30,000', '30,000-49,999', '50,000-99,999', '100,000 or more']})

fig.update_layout(template="plotly_white",
                  title='Income Distribution by religious group <br><sup> Data Source: Pew Research Center, Religious Lansdcape Study</sup>',
                  title_x=0.5,
                  yaxis={'categoryorder': 'category descending'},
                  xaxis_title="Proportion",
                  yaxis_title="",
                  legend_title='Income',
                  legend={'orientation':"h",'yanchor':"bottom",'y':-0.2, 'xanchor':"right",'x':1.05},
                  font={'family':"Times",'size':14,'color':"Darkblue"},
                  autosize=False,
                  width=800,
                  height=600)

fig.show()

Using Beautiful Soup to get HTML code

Websites are built using html code. That code tells the web browser (FireFox, Chrome, etc) what to display. Websites can be very simple (just html) to much more complicated (java script +). When you load a website you can always see the source code.

This is what beautiful soup downloads. For static (simple) sites this code is immediately available. More complicated sites might require Python to open the webpage, let the content render, and then download the code.

How to get data from static sites:

import requests
from bs4 import BeautifulSoup

website = 'name of your website'

raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

Use developer tools!

Once you have the soup you can use .find_all() to search for things in the html code.

The .find_all() function searches through the information in soup to match and return only sections that match the info. Here are some important types you might search for:

‘h2’ - this is a heading
‘div’ - this divides a block of information
‘span’ - this divides inline information
‘a’ - this specifies a hyperlink
‘li’ - this is a list item
class_= - many things have the class label (notice the underscore!)
string= - you can also search by strings.

Code Example

Let’s say I want the country information, here is an example of the code that contains the country information:

    <div class="col-md-4 country">
        <h3 class="country-name">
            <i class="flag-icon flag-icon-ad"></i>
            Andorra
        </h3>
        <div class="country-info">
            <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
            <strong>Population:</strong> <span class="country-population">84000</span><br>
            <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
        </div>
    </div>

Beautiful soup web scraping code

import requests
from bs4 import BeautifulSoup

# Get the data
website = 'https://www.scrapethissite.com/pages/simple/'
raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

# Search the soup
result = soup.find_all('h3',class_="country-name")
DF = pd.DataFrame()
DF['country']=result
DF['country'] =DF['country'].apply(lambda x: x.text.rstrip().lstrip())
DF

	country
0	Andorra
1	United Arab Emirates
2	Afghanistan
3	Antigua and Barbuda
4	Anguilla
...	...
245	Yemen
246	Mayotte
247	South Africa
248	Zambia
249	Zimbabwe

250 rows × 1 columns