# Some basic package imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'Intermediate Data Science
Pandas - Advanced Topics
Welcome to Intermediate Data Science!
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours take place in Duke 209 – Office Hours Schedule
- Class Website
- Syllabus
Review of DATA 101
In completing the review of DATA 101 you have already started working with Pandas! You actually know a lot about how Pandas works from our last class. The goal for today is to dig a bit deeper into the type of data that is stored in a Pandas DataFrame and to see some of the main capabilities of Pandas.
Data Structures in Pandas
The fundamental data types in Pandas are:
- Series: a one-dimensional object containing a sequence of values of the same type and an associated array of data labels.
- DataFrame: a rectangular table of data that contains an ordered, named collection of columns each of which could have a different data type. It has both a row and column index and it can be though of as a dictionary of Series.
Series Basics
example_series = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
example_seriesd 4
b 7
a -5
c 3
dtype: int64
# You can get the labels"
example_series.indexIndex(['d', 'b', 'a', 'c'], dtype='object')
# Getting values from the series using the labels
print(example_series['a'])
# Getting multiple values
print(example_series[['a','b']])-5
a -5
b 7
dtype: int64
# You can grab all the data
example_series.array<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64
# You can send it to a dictionary
example_series.to_dict(){'d': 4, 'b': 7, 'a': -5, 'c': 3}
# You can filter the data and use boolean tests
# Show only positive values
example_series[example_series>0]d 4
b 7
c 3
dtype: int64
# Check index for membership
'b' in example_seriesTrue
# If you have a dictionary you can send it into a series.
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
example_series = pd.Series(sdata)
example_seriesOhio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
You Try:
Using series methods and python code:
- Get a list of the states that are the index.
- Check if Idaho is in the index.
- Get just the states with numbers less then 20,000.
- Run the provided code and explain the results. Why do we end up with NaN and what does NaN mean?
# Your code here 1# Your code here 2# Your code here 3# Explain the results 4
states = ["California", "Ohio", "Oregon", "Texas"]
new_series = pd.Series(sdata, index=states)
new_seriesCalifornia NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
Enter your words here:
Series NaNs
new_series.isna()California True
Ohio False
Oregon False
Texas False
dtype: bool
new_series.notna()California False
Ohio True
Oregon True
Texas True
dtype: bool
# Drop the NA data one time:
new_series.dropna()Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
# Notice that the original data is still in memory
new_seriesCalifornia NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
# Drop the NA data from memory
new_series.dropna(inplace=True)# Now California is no longer there!
new_seriesOhio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
DataFrames - Basics
Most of the time we will be working with DataFrames! Dictionaries work really well to define new data frames. You can access information using the column labels/keys or the row indexes
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
df| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 |
| 1 | Ohio | 2001 | 1.7 |
| 2 | Ohio | 2002 | 3.6 |
| 3 | Nevada | 2001 | 2.4 |
| 4 | Nevada | 2002 | 2.9 |
| 5 | Nevada | 2003 | 3.2 |
df.keys()Index(['state', 'year', 'pop'], dtype='object')
# Notice that each column is a Series with the index being whatever the row labels are
print(type(df['year']))
df['year']<class 'pandas.core.series.Series'>
0 2000
1 2001
2 2002
3 2001
4 2002
5 2003
Name: year, dtype: int64
# If your column labels are well behaved you can access the data as a dot attribute
df.year0 2000
1 2001
2 2002
3 2001
4 2002
5 2003
Name: year, dtype: int64
# Getting multiple columns
columns = ['state','pop']
df[columns]| state | pop | |
|---|---|---|
| 0 | Ohio | 1.5 |
| 1 | Ohio | 1.7 |
| 2 | Ohio | 3.6 |
| 3 | Nevada | 2.4 |
| 4 | Nevada | 2.9 |
| 5 | Nevada | 3.2 |
One of the first things you will want to do is see what your data looks like and get basic information about the data frame. This is especially true when you read in big data sets.
df.head() #first five rows| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 |
| 1 | Ohio | 2001 | 1.7 |
| 2 | Ohio | 2002 | 3.6 |
| 3 | Nevada | 2001 | 2.4 |
| 4 | Nevada | 2002 | 2.9 |
df.tail() #last five rows| state | year | pop | |
|---|---|---|---|
| 1 | Ohio | 2001 | 1.7 |
| 2 | Ohio | 2002 | 3.6 |
| 3 | Nevada | 2001 | 2.4 |
| 4 | Nevada | 2002 | 2.9 |
| 5 | Nevada | 2003 | 3.2 |
df.shape # get the number of (rows,columns)(6, 3)
DataFrames - Data Manipulation
# We can add new columns! Just give the column a name and some data!
# Here I am using the numpy random sample command.
df['debt'] = np.random.random_sample(len(df))
df| state | year | pop | debt | |
|---|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 | 0.926616 |
| 1 | Ohio | 2001 | 1.7 | 0.247992 |
| 2 | Ohio | 2002 | 3.6 | 0.258766 |
| 3 | Nevada | 2001 | 2.4 | 0.916836 |
| 4 | Nevada | 2002 | 2.9 | 0.388147 |
| 5 | Nevada | 2003 | 3.2 | 0.302799 |
# We can delete a column
del df['debt']
df| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 |
| 1 | Ohio | 2001 | 1.7 |
| 2 | Ohio | 2002 | 3.6 |
| 3 | Nevada | 2001 | 2.4 |
| 4 | Nevada | 2002 | 2.9 |
| 5 | Nevada | 2003 | 3.2 |
You Try
- How does the following command work?
- See if you can add a column that check if the year is greater than 2001.
# How does this work 1
df['eastern'] = df['state'] == 'Ohio'
df| state | year | pop | eastern | |
|---|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 | True |
| 1 | Ohio | 2001 | 1.7 | True |
| 2 | Ohio | 2002 | 3.6 | True |
| 3 | Nevada | 2001 | 2.4 | False |
| 4 | Nevada | 2002 | 2.9 | False |
| 5 | Nevada | 2003 | 3.2 | False |
Enter your words here:
# Your code here 2Data Frames - Functionality
There are lots of things you will want to be able to do to make your data cleaner and more readable. This is just a short list of some of the most popular commands and operations.
# Reindexing updates the index (row labels)
df = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=["a", "c", "d"],
columns=["Ohio", "Texas", "California"])
df| Ohio | Texas | California | |
|---|---|---|---|
| a | 0 | 1 | 2 |
| c | 3 | 4 | 5 |
| d | 6 | 7 | 8 |
# now notice that we are missing b!
df = df.reindex(index=["a", "b", "c", "d"])
df| Ohio | Texas | California | |
|---|---|---|---|
| a | 0.0 | 1.0 | 2.0 |
| b | NaN | NaN | NaN |
| c | 3.0 | 4.0 | 5.0 |
| d | 6.0 | 7.0 | 8.0 |
# We can also transpose the data so the columns become the index
df = df.T
df| a | b | c | d | |
|---|---|---|---|---|
| Ohio | 0.0 | NaN | 3.0 | 6.0 |
| Texas | 1.0 | NaN | 4.0 | 7.0 |
| California | 2.0 | NaN | 5.0 | 8.0 |
# Once you have your index set you can get the row information
# .loc locates the row by the index name
df.loc['Ohio']a 0.0
b NaN
c 3.0
d 6.0
Name: Ohio, dtype: float64
# .iloc uses an integer to find the location
df.iloc[0]a 0.0
b NaN
c 3.0
d 6.0
Name: Ohio, dtype: float64
# You can drop rows
df = df.drop('Ohio')
df| a | b | c | d | |
|---|---|---|---|---|
| Texas | 1.0 | NaN | 4.0 | 7.0 |
| California | 2.0 | NaN | 5.0 | 8.0 |
# You can drop columns
df = df.drop(columns='d')
df| a | b | c | |
|---|---|---|---|
| Texas | 1.0 | NaN | 4.0 |
| California | 2.0 | NaN | 5.0 |
# You can drop NaNs but be careful it drops every row that has a NaN
df = df.dropna()
df| a | b | c |
|---|
DataFrames - Filtering and Masking
df = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=["a", "c", "d"],
columns=["Ohio", "Texas", "California"])
df| Ohio | Texas | California | |
|---|---|---|---|
| a | 0 | 1 | 2 |
| c | 3 | 4 | 5 |
| d | 6 | 7 | 8 |
# Boolean comparison are the entries greater than three?
df>3| Ohio | Texas | California | |
|---|---|---|---|
| a | False | False | False |
| c | False | True | True |
| d | True | True | True |
# Return data frame values
df[df>3]| Ohio | Texas | California | |
|---|---|---|---|
| a | NaN | NaN | NaN |
| c | NaN | 4.0 | 5.0 |
| d | 6.0 | 7.0 | 8.0 |
# Set values based on comparison
df[df<3] = 0
df| Ohio | Texas | California | |
|---|---|---|---|
| a | 0 | 0 | 0 |
| c | 3 | 4 | 5 |
| d | 6 | 7 | 8 |
# Mask return only rows that match values
mask = df['Texas']>3
df[mask]| Ohio | Texas | California | |
|---|---|---|---|
| c | 3 | 4 | 5 |
| d | 6 | 7 | 8 |
# Get a specific row and column combination
df.loc['a',['Texas','California']]Texas 0
California 0
Name: a, dtype: int64
You Try
- Get rows a and d for just Ohio and Texas.
# Your code hereData Frames - Arithmetic
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4],"A": [5, 6]})df1| A | |
|---|---|
| 0 | 1 |
| 1 | 2 |
df2| B | A | |
|---|---|---|
| 0 | 3 | 5 |
| 1 | 4 | 6 |
# Basic operations are component-wise
df1*2| A | |
|---|---|
| 0 | 2 |
| 1 | 4 |
# Division works too
1/df1| A | |
|---|---|
| 0 | 1.0 |
| 1 | 0.5 |
# Adding two data frames works, but only on columns that have the same label
df1+df2| A | B | |
|---|---|---|
| 0 | 6 | NaN |
| 1 | 8 | NaN |
Data Frames - Functions
Often we want to apply a special function to parts of our data frame. We can do this using functions!
df = pd.DataFrame(np.random.standard_normal((4, 3)),
columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])df| b | d | e | |
|---|---|---|---|
| Utah | -2.199103 | 0.205690 | -1.908214 |
| Ohio | 2.281752 | -2.050707 | 0.006077 |
| Texas | 0.417250 | -0.500769 | -0.658412 |
| Oregon | -0.213584 | -0.595468 | 0.681829 |
# Many simple functions can be applied directly
np.abs(df)| b | d | e | |
|---|---|---|---|
| Utah | 2.199103 | 0.205690 | 1.908214 |
| Ohio | 2.281752 | 2.050707 | 0.006077 |
| Texas | 0.417250 | 0.500769 | 0.658412 |
| Oregon | 0.213584 | 0.595468 | 0.681829 |
# We can define our own functions and apply them
def max_min(x):
return x.max()-x.min()
# Now apply this to each of the columns
df.apply(max_min)b 4.480856
d 2.256396
e 2.590043
dtype: float64
# Apply it to each of the rows
df.apply(max_min, axis='columns')Utah 2.404793
Ohio 4.332459
Texas 1.075662
Oregon 1.277297
dtype: float64
# You can also use a lambda to do this - a quick way to define a one-time-use function.
df.apply(lambda x: x.max()-x.min())b 4.480856
d 2.256396
e 2.590043
dtype: float64
Data Frames - Sorting and Grouping
df| b | d | e | |
|---|---|---|---|
| Utah | -2.199103 | 0.205690 | -1.908214 |
| Ohio | 2.281752 | -2.050707 | 0.006077 |
| Texas | 0.417250 | -0.500769 | -0.658412 |
| Oregon | -0.213584 | -0.595468 | 0.681829 |
df.sort_values(by='b', ascending=False)| b | d | e | |
|---|---|---|---|
| Ohio | 2.281752 | -2.050707 | 0.006077 |
| Texas | 0.417250 | -0.500769 | -0.658412 |
| Oregon | -0.213584 | -0.595468 | 0.681829 |
| Utah | -2.199103 | 0.205690 | -1.908214 |
# Multi Sort
df.sort_values(by=['b','e'], ascending=False)| b | d | e | |
|---|---|---|---|
| Ohio | 2.281752 | -2.050707 | 0.006077 |
| Texas | 0.417250 | -0.500769 | -0.658412 |
| Oregon | -0.213584 | -0.595468 | 0.681829 |
| Utah | -2.199103 | 0.205690 | -1.908214 |
# You can also sort by the index values
df.sort_index()| b | d | e | |
|---|---|---|---|
| Ohio | 2.281752 | -2.050707 | 0.006077 |
| Oregon | -0.213584 | -0.595468 | 0.681829 |
| Texas | 0.417250 | -0.500769 | -0.658412 |
| Utah | -2.199103 | 0.205690 | -1.908214 |
df.sort_index(ascending=False)| b | d | e | |
|---|---|---|---|
| Utah | -2.199103 | 0.205690 | -1.908214 |
| Texas | 0.417250 | -0.500769 | -0.658412 |
| Oregon | -0.213584 | -0.595468 | 0.681829 |
| Ohio | 2.281752 | -2.050707 | 0.006077 |
df = pd.DataFrame(np.random.standard_normal((4, 3)),
columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])
df['location'] = ['West', 'Midwest', 'South', 'West']
df| b | d | e | location | |
|---|---|---|---|---|
| Utah | -0.284299 | -2.377612 | -0.466475 | West |
| Ohio | -0.333410 | 0.946558 | 0.844442 | Midwest |
| Texas | 0.171615 | -0.677120 | -0.176359 | South |
| Oregon | 1.135900 | -0.358185 | 1.003728 | West |
# You can use groupby for categorical data
groups = df.groupby('location')
for g in groups:
print(g[0])
display(g[1])Midwest
South
West
| b | d | e | location | |
|---|---|---|---|---|
| Ohio | -0.33341 | 0.946558 | 0.844442 | Midwest |
| b | d | e | location | |
|---|---|---|---|---|
| Texas | 0.171615 | -0.67712 | -0.176359 | South |
| b | d | e | location | |
|---|---|---|---|---|
| Utah | -0.284299 | -2.377612 | -0.466475 | West |
| Oregon | 1.135900 | -0.358185 | 1.003728 | West |
DataFrames - Quick Descriptive Statistics
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=["a", "b", "c", "d"],
columns=["one", "two"])
df| one | two | |
|---|---|---|
| a | 1.40 | NaN |
| b | 7.10 | -4.5 |
| c | NaN | NaN |
| d | 0.75 | -1.3 |
# Reduction methods: min, max, sum, mean - require at least one non NaN value
# It skips NaN values - and this can be misleading
df.sum(axis='index') # index is defaultone 9.25
two -5.80
dtype: float64
df.sum(axis='columns')a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
df.sum(axis='columns',skipna=False)a NaN
b 2.60
c NaN
d -0.55
dtype: float64
df.describe()| one | two | |
|---|---|---|
| count | 3.000000 | 2.000000 |
| mean | 3.083333 | -2.900000 |
| std | 3.493685 | 2.262742 |
| min | 0.750000 | -4.500000 |
| 25% | 1.075000 | -3.700000 |
| 50% | 1.400000 | -2.900000 |
| 75% | 4.250000 | -2.100000 |
| max | 7.100000 | -1.300000 |
# Lets read in some data and look at some statistics
price = pd.read_pickle("data/yahoo_price.pkl")
volume = pd.read_pickle("data/yahoo_volume.pkl")price.head()| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| Date | ||||
| 2010-01-04 | 27.990226 | 313.062468 | 113.304536 | 25.884104 |
| 2010-01-05 | 28.038618 | 311.683844 | 111.935822 | 25.892466 |
| 2010-01-06 | 27.592626 | 303.826685 | 111.208683 | 25.733566 |
| 2010-01-07 | 27.541619 | 296.753749 | 110.823732 | 25.465944 |
| 2010-01-08 | 27.724725 | 300.709808 | 111.935822 | 25.641571 |
volume.head()| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| Date | ||||
| 2010-01-04 | 123432400 | 3927000 | 6155300 | 38409100 |
| 2010-01-05 | 150476200 | 6031900 | 6841400 | 49749600 |
| 2010-01-06 | 138040000 | 7987100 | 5605300 | 58182400 |
| 2010-01-07 | 119282800 | 12876600 | 5840600 | 50559700 |
| 2010-01-08 | 111902700 | 9483900 | 4197200 | 51197400 |
# Lets apply a built in time series function to this data
# This gives us the percent change in the price from one date to the next
returns = price.pct_change()
returns.head()| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| Date | ||||
| 2010-01-04 | NaN | NaN | NaN | NaN |
| 2010-01-05 | 0.001729 | -0.004404 | -0.012080 | 0.000323 |
| 2010-01-06 | -0.015906 | -0.025209 | -0.006496 | -0.006137 |
| 2010-01-07 | -0.001849 | -0.023280 | -0.003462 | -0.010400 |
| 2010-01-08 | 0.006648 | 0.013331 | 0.010035 | 0.006897 |
returns.describe()| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| count | 1713.000000 | 1713.000000 | 1713.000000 | 1713.000000 |
| mean | 0.000972 | 0.000671 | 0.000236 | 0.000595 |
| std | 0.016641 | 0.015830 | 0.012102 | 0.014667 |
| min | -0.123558 | -0.083775 | -0.082790 | -0.113995 |
| 25% | -0.007516 | -0.006904 | -0.006049 | -0.007376 |
| 50% | 0.000886 | 0.000270 | 0.000234 | 0.000312 |
| 75% | 0.010422 | 0.008462 | 0.006806 | 0.008162 |
| max | 0.088741 | 0.160524 | 0.056652 | 0.104522 |
# Is the percent change in AAPL correlated with IBM
returns['AAPL'].corr(returns['IBM'])np.float64(0.3868174361139099)
# We might ask if any of these returns are correlated or which are most highly correlated
returns.corr()| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| AAPL | 1.000000 | 0.407919 | 0.386817 | 0.389695 |
| GOOG | 0.407919 | 1.000000 | 0.405099 | 0.465919 |
| IBM | 0.386817 | 0.405099 | 1.000000 | 0.499764 |
| MSFT | 0.389695 | 0.465919 | 0.499764 | 1.000000 |
# We might want to look at a covariance matrix
returns.cov()| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| AAPL | 0.000277 | 0.000107 | 0.000078 | 0.000095 |
| GOOG | 0.000107 | 0.000251 | 0.000078 | 0.000108 |
| IBM | 0.000078 | 0.000078 | 0.000146 | 0.000089 |
| MSFT | 0.000095 | 0.000108 | 0.000089 | 0.000215 |
THERE ARE LOTS OF FUNTIONS PROVIDED! TRY . [tab]!
DataFrames - Value Counts and Membership
df = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
"Qu2": [2, 3, 1, 2, 3],
"Qu3": [1, 5, 2, 4, 4]})
df| Qu1 | Qu2 | Qu3 | |
|---|---|---|---|
| 0 | 1 | 2 | 1 |
| 1 | 3 | 3 | 5 |
| 2 | 4 | 1 | 2 |
| 3 | 3 | 2 | 4 |
| 4 | 4 | 3 | 4 |
# Quickly count observations
df['Qu1'].value_counts().sort_index()Qu1
1 1
3 2
4 2
Name: count, dtype: int64
# Look for Unique values
df['Qu1'].unique()array([1, 3, 4])
Day 2 - Homework - Quick analysis!
Your goal is to do a quick analysis of the Titanic data! You can answer any questions that you find interesting but here are some things to start with:
- How many variables and observations? Which are Numerical/Categorical?
- Do any of the columns have NaNs in them?
- How many passengers survived?
- Is survival correlated with Fare?
- How many passengers were alone vs. traveling with family?
- Were people traveling alone more or less likely to survive?
and so on… see if you can come up with some questions of your own!
# !conda install -y kagglehubimport kagglehub
# Download latest version
path = kagglehub.dataset_download("yasserh/titanic-dataset")
print("Path to dataset files:", path)Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.8), please consider upgrading to the latest version (0.3.13).
Path to dataset files: /home/bellajagu/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1
Variable Notes - PassengerId: Unique ID of the passenger - Survived: Survived (1) or died (0) - Pclass: Passenger’s class (1st, 2nd, or 3rd) - Name: Passenger’s name - Sex: Passenger’s sex - Age: Passenger’s age - SibSp: Number of siblings/spouses aboard the Titanic - Parch: Number of parents/children aboard the Titanic - Ticket: Ticket number - Fare: Fare paid for ticket - Cabin: Cabin number - Embarked: Where the passenger got on the ship (C — Cherbourg, S — Southampton, Q = Queenstown)
file = '/home/bellajagu/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1/Titanic-Dataset.csv'
df = pd. read_csv(file)df| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns