Basic Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
Exam2 Python Commands Cheat Sheet
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
= 'colab'
pio.renderers.defaule
from itables import show
It is useful to be able to enter your own data. This helps with using Python in your science classes and in redoing plots when you suspect misrepresentation. Here is the process:
Create an empty data frame
DF = pd.DataFrame()
Enter the data in lists
column1 = [-,-,-,-]
column2 = [-,-,-,-]
Add the columns to the data frame
DF['column1'] = column1
DF['column2'] = column2
= pd.DataFrame()
df
=['Alice', 'Bob', 'Eve', 'Eve', 'Alice', 'Bob']
names = ['one', 'one', 'one', 'two', 'two','two']
examnums = [92, 95, 70, 86, 90, 80]
grades
'name'] = names
df['exam'] = examnums
df['grade'] = grades
df[
df
name | exam | grade | |
---|---|---|---|
0 | Alice | one | 92 |
1 | Bob | one | 95 |
2 | Eve | one | 70 |
3 | Eve | two | 86 |
4 | Alice | two | 90 |
5 | Bob | two | 80 |
Sometimes it is nice to give your columns easier to use names. If your data is downloaded with hard to remember or hard to type names it is more likely that you will make errors using those names later. I usually try to remove spaces in names and reduce the length of the names. We can use the .rename() function to do this:
DF.rename( columns={ 'old name 1':'new name 1' , 'old name 2':'new name 2' , 'old name 3':'new name 3', ... }, inplace=True )
Here we use curly brackets {} to surround the names. You list the old name first then a colon then the new name. The flag inplace=True tells Python the change the data frame directly, not just in the print out. You can rename as many or as few columns as you want.
={'name' : 'student_name'}, inplace=True )
df.rename( columns df
student_name | exam | grade | |
---|---|---|---|
0 | Alice | one | 92 |
1 | Bob | one | 95 |
2 | Eve | one | 70 |
3 | Eve | two | 86 |
4 | Alice | two | 90 |
5 | Bob | two | 80 |
Often the data we are given is in an order that is hard to use. To rearrange the data in our data frame we have two options
Here I take the data frame above and I want the student names to be the index (index=left hand column). I want the exam numbers to appear as the columns (columns = categorical data use to label each column). I want the grade to be the values (values = data that is inside of the data frame)
= df.pivot(index='student_name', columns='exam', values='grade')
df_new df_new
exam | one | two |
---|---|---|
student_name | ||
Alice | 92 | 90 |
Bob | 95 | 80 |
Eve | 70 | 86 |
Here is some new data for exploring the pd.melt() function.
= {'Name': ['Alice', 'Bob', 'Eve'],
data 'Math_2022': [85, 90, 78],
'Math_2023': [92, 88, 95],
'Science_2022': [70, 82, 75],
'Science_2023': [75, 85, 80]}
= pd.DataFrame(data)
df df
Name | Math_2022 | Math_2023 | Science_2022 | Science_2023 | |
---|---|---|---|---|---|
0 | Alice | 85 | 92 | 70 | 75 |
1 | Bob | 90 | 88 | 82 | 85 |
2 | Eve | 78 | 95 | 75 | 80 |
In this data we want to melt the column labels into the data frame so that they can become categorical information.
To do this we need to first choose the Name to be the id_vars (id_vars = the columns that will remain in the new data set). In this case we sill want a column called Names. The other two variables are what the remaining columns will be named. var_name is the name of the column for the melted data. In this case the column “Subject_Year” will contain the categorical data: Math_2022 Math_2023 Science_2022 Science_2023. The value_name is the name of the column that will hold the data that is inside the old data frame. In this case the column ‘Score’ will be the grades (numbers).
= pd.melt(df, id_vars=['Name'], var_name='Subject_Year', value_name='Score')
df_new df_new
Name | Subject_Year | Score | |
---|---|---|---|
0 | Alice | Math_2022 | 85 |
1 | Bob | Math_2022 | 90 |
2 | Eve | Math_2022 | 78 |
3 | Alice | Math_2023 | 92 |
4 | Bob | Math_2023 | 88 |
5 | Eve | Math_2023 | 95 |
6 | Alice | Science_2022 | 70 |
7 | Bob | Science_2022 | 82 |
8 | Eve | Science_2022 | 75 |
9 | Alice | Science_2023 | 75 |
10 | Bob | Science_2023 | 85 |
11 | Eve | Science_2023 | 80 |
When you find NaN’s in your data, there are some really nice commands for dealing with NaN’s in your data:
First NaNs are a strange data type (np.nan) they are considered a float - like a decimal. In most raw data sets NaN means no data was given for that observation and variable, but be careful NaN can also happen if you do a calculation and accidentally divide by zero.
.isna() creates a mask for whether or not there is a NaN in each row of the data.
.fillna() will replace NaN in your data set with whatever you put inside the parenthesis
.dropna() will drop all rows that contain NaN - becareful with this command. You want to keep as much data as possible and .dropna() might delete too much!
I will add some NaNs to our data frame above
for i,s in enumerate(df_new['Score']):
if s <= 80:
'Score'][i] = np.nan
df_new[ df_new
Name | Subject_Year | Score | |
---|---|---|---|
0 | Alice | Math_2022 | 85.0 |
1 | Bob | Math_2022 | 90.0 |
2 | Eve | Math_2022 | NaN |
3 | Alice | Math_2023 | 92.0 |
4 | Bob | Math_2023 | 88.0 |
5 | Eve | Math_2023 | 95.0 |
6 | Alice | Science_2022 | NaN |
7 | Bob | Science_2022 | 82.0 |
8 | Eve | Science_2022 | NaN |
9 | Alice | Science_2023 | NaN |
10 | Bob | Science_2023 | 85.0 |
11 | Eve | Science_2023 | NaN |
# This creates a mask of whether or not an entry has NaN
print(df_new['Score'].isna())
print('-------------------------')
# We can add these up
print(sum(df_new['Score'].isna()))
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 True
9 True
10 False
11 True
Name: Score, dtype: bool
-------------------------
5
= df_new.dropna()
df_no_na df_no_na
Name | Subject_Year | Score | |
---|---|---|---|
0 | Alice | Math_2022 | 85.0 |
1 | Bob | Math_2022 | 90.0 |
3 | Alice | Math_2023 | 92.0 |
4 | Bob | Math_2023 | 88.0 |
5 | Eve | Math_2023 | 95.0 |
7 | Bob | Science_2022 | 82.0 |
10 | Bob | Science_2023 | 85.0 |
# Notice that this does not work!
print(df_new['Score'] != np.nan)
print('-------------------------')
# I prefer to fill those NaNs with sometime I can search for and then do a normal mask
'Score'].fillna(0,inplace=True)
df_new[print(df_new['Score'] != 0)
print('-------------------------')
# Here is the code for the mask
= df_new['Score'] != 0
mask df_new[mask]
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
Name: Score, dtype: bool
-------------------------
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 True
8 False
9 False
10 True
11 False
Name: Score, dtype: bool
-------------------------
Name | Subject_Year | Score | |
---|---|---|---|
0 | Alice | Math_2022 | 85.0 |
1 | Bob | Math_2022 | 90.0 |
3 | Alice | Math_2023 | 92.0 |
4 | Bob | Math_2023 | 88.0 |
5 | Eve | Math_2023 | 95.0 |
7 | Bob | Science_2022 | 82.0 |
10 | Bob | Science_2023 | 85.0 |
Here we see that to read this alphabetically we have to read up from the bottom. Mathematically it makes sense to count along the y-axis starting from zero, but with words it seems a little bit strange. We can do this in the figure layout. We just add:
yaxis={'categoryorder': 'category descending'}
we can also change ‘category descending’ to ‘category ascending’ if we wanted to foce ascending. And we could change from yaxis to xaxis if we had categories along the x-axis.
Then we can fix the category order using the command
category_orders={'category name' : [write the order of the categories here]}
We can add a built in color sequence using the command:
color_discrete_sequence=px.colors.qualitative.????
here you can use tab complete to see what options you have for ????
You use the command
template="template name"
Here are some built in options: [“plotly”, “plotly_white”, “plotly_dark”, “ggplot2”, “seaborn”, “simple_white”]
Plotly will let you place the legend exactly where you want it using the legend layout information:
egend={'orientation':"h",'yanchor':"bottom",'y':1.05, 'xanchor':"right",'x':0.95}
For the graph below I used the Religious Income Data from Day 9.
= px.histogram(DF_new,
fig ='religion',
y='proportion',
x='income',
color='percent',
barnorm=px.colors.qualitative.Vivid,
color_discrete_sequence={'income' : ['Less than 30,000', '30,000-49,999', '50,000-99,999', '100,000 or more']})
category_orders
="plotly_white",
fig.update_layout(template='Income Distribution by religious group <br><sup> Data Source: Pew Research Center, Religious Lansdcape Study</sup>',
title=0.5,
title_x={'categoryorder': 'category descending'},
yaxis="Proportion",
xaxis_title="",
yaxis_title='Income',
legend_title={'orientation':"h",'yanchor':"bottom",'y':-0.2, 'xanchor':"right",'x':1.05},
legend={'family':"Times",'size':14,'color':"Darkblue"},
font=False,
autosize=800,
width=600)
height
fig.show()
Websites are built using html code. That code tells the web browser (FireFox, Chrome, etc) what to display. Websites can be very simple (just html) to much more complicated (java script +). When you load a website you can always see the source code.
This is what beautiful soup downloads. For static (simple) sites this code is immediately available. More complicated sites might require Python to open the webpage, let the content render, and then download the code.
import requests
from bs4 import BeautifulSoup
website = 'name of your website'
raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')
Once you have the soup you can use .find_all() to search for things in the html code.
The .find_all() function searches through the information in soup to match and return only sections that match the info. Here are some important types you might search for:
‘h2’ - this is a heading
‘div’ - this divides a block of information
‘span’ - this divides inline information
‘a’ - this specifies a hyperlink
‘li’ - this is a list item
class_= - many things have the class label (notice the underscore!)
string= - you can also search by strings.
Let’s say I want the country information, here is an example of the code that contains the country information:
<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
Andorra
</h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
<strong>Population:</strong> <span class="country-population">84000</span><br>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
</div>
</div>
import requests
from bs4 import BeautifulSoup
# Get the data
= 'https://www.scrapethissite.com/pages/simple/'
website = requests.get(website)
raw_code = raw_code.text
html_doc = BeautifulSoup(html_doc, 'html.parser')
soup
# Search the soup
= soup.find_all('h3',class_="country-name")
result = pd.DataFrame()
DF 'country']=result
DF['country'] =DF['country'].apply(lambda x: x.text.rstrip().lstrip())
DF[ DF
country | |
---|---|
0 | Andorra |
1 | United Arab Emirates |
2 | Afghanistan |
3 | Antigua and Barbuda |
4 | Anguilla |
... | ... |
245 | Yemen |
246 | Mayotte |
247 | South Africa |
248 | Zambia |
249 | Zimbabwe |
250 rows × 1 columns