Titanic: Machine Learning from Disaster¶
Kaggle Competition¶
Competition Description¶
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Data Ingestion¶
Let's start out by loading in the data. The Kaggle competition supplies both the training and test data in two .csv files. Download the data and simply point the Panda's read_csv to the file location. After loading in the data, I like to look at the first few lines to get an idea of the schema.
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns #visualizations
import scipy as scipy
trainpath = "/train.csv"
testpath = "/test.csv"
traindf = pd.read_csv(trainpath)
testdf = pd.read_csv(testpath)
traindf.head(5)
As you can see, some of the feature names could be confusing if there wasn't any other documentation. Thankfully the Kaggle competition contained a data dictionary that explains what each feature represents.
FEATURE DESCRIPTIONS:¶
Feature | Description |
---|---|
survived | Survival (0 = No; 1 = Yes) |
pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
name | Name |
sex | Sex |
age | Age |
sibsp | Number of Siblings/Spouses Aboard |
parch | Number of Parents/Children Aboard |
ticket | Ticket Number |
fare | Passenger Fare |
cabin | Cabin |
embarked | Port of Embarkation(C = Cherbourg; Q = Queenstown; S = Southampton) |
SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) Parent: Mother or Father of Passenger Aboard Titanic Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
Initial Exploration¶
After loading up a new dataset it is often helpful to do some initial exploration of the data to get an idea of how "clean" the data is and note any obvious patterns. One of the easiest ways to do this in Pandas is through the .describe method. The describe method generates various summary statistics, excluding any missing/null values.
traindf.describe()
testdf.describe()
By simply running the describe method we can learn quite a bit about our training and test sets. First, we can tell that our training set is about twice as large as our test set. Second, by looking at the "counts" we know that there are missing values that will need to be addressed in order to run any classifier algorithms. We can also get an idea of the average age, fare, and family size of the passengers.
Let's create some initial visualizations of our training set to see if any obvious patterns of survival emerge:
patagonia_colors = ["#222343", "#bbca42", "#7abbdb", "#eca935", "#ec7869", "#332212"]
current_palette = sns.color_palette(patagonia_colors)
sns.palplot(current_palette)
sns.set_palette(current_palette)
sns.set_context("talk")
sns.set_style("white")
sns.set_style("ticks")
Now that we have our plots formatted, let's get started by first comparing how many passengers in our training data survived.
sns.countplot(x="Survived", data=traindf, color="#222343")
sns.despine(bottom=True)
As expected there are more people who died than survived in the training data. Let's look a little deeper to see if there are any features that increase your chances of surviving. Even if all you know about the Titanic disaster was learned from the movie Titanic, you would know they tried to save women and children first. Let's take a look at the number of survivors by gender and age:
sns.barplot(x="Sex", y="Survived", capsize=.25, color="#222343", errcolor="#bbca42", data=traindf)
sns.despine(bottom=True)
The plot above shows the percentage of male and female passengers in the training set that survived. We can see that being female drastically increases a passenger's survival chances (~75% of females live vs. ~20% of males live). Let's see how age affects survival within the two sexs:
sns.swarmplot(x="Age", y="Sex", hue="Survived", data=traindf)
sns.despine(trim=True, left=True)
The swarmplot above is a useful way of looking at data when you want to look at the individual observations along with some representation of the underlying distribution. This plot is particularly useful for examining this dataset, because it allows us to quickly determine survival rate by both Age and Sex. For example, we can see that the chances of a male surviving are very low when compared to a female. However this does not apply to young males. It looks like being a child (no matter the Sex) drastically increased a passengers chances of survival. We can also see that there are more males than females in our dataset and that the distribution of ages is older for males than females. By looking at the proportion of green to blue dots it is easy to see why a simple gender model scores so well for this challenge!
#Lines on markers represent 95% ci
sns.pointplot(x="Pclass", y="Survived", hue="Sex", palette=["#222343","#ec7869"], order=[3,2,1], data=traindf)
sns.despine(bottom=True, right=False)
The slopegraph above shows the increase in average survival rate when examining Passenger Class. From the graph, it appears that being a female in either first or second class all but guaranteed your survival. The survival rate of females in third class is about half as high as females in first or second class. However, females in third class still have more than double the survival rate of males in thrid class. While there is some increase in the survival rate of males from second to third class, the likelihood of a male surviving is still below 50%.
Data Wrangling / Munging¶
We know from running the describe method earlier that there are some missing values in our dataset that need to be addressed. Let's look at those missing values a little closer:
print traindf.isnull().sum()
print "\n"
print testdf.isnull().sum()
combined = traindf.append(testdf)
combined.describe()
combined.tail(5)
combined.isnull().sum()
Because the training dataset is relatively small, I combined the training and test dataset into a new "combined" dataset will be used to interpolate the missing values. I believe this gives the best chance at an accurate result for filling in the missing values. This step may not be necessary for problems with a large training set.
From the table above, we can see that the Cabin feature has the highest number of missing values. Because this is a categorical variable, I decided to simply create a new category, "U" for unknown. I didn't see the importance of noting the individual room of each passenger, so in order to simplify the feature, I simply took the first letter of the Cabin. The first letter should indicate the deck of the ship where the passenger's room was located. The code below fills in "na" for values with "U" and then maps a labmda function that replaces each cabin number with the first letter of the entry. i.e. C105 becomes "C"
#Replaces value of missing cabin with U for unknown
combined.Cabin.fillna("U",inplace=True)
# mapping each Cabin value with the cabin letter
combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])
combined.tail(5)
Now that we have our Cabin feature completed, let's see what the survival rate of passengers is broken down by cabin:
sns.barplot(x="Cabin",y="Survived", capsize=.25, color="#222343", errcolor="#bbca42", data = combined)
sns.despine(trim=True, bottom=True)
From the graph above it looks like the survival rates of known Cabins is about twice as high as the unknown ("U") cabin. Furthermore the survival rate of konwn cabins looks to be somewhere between .5 and .8. Because the biggest difference in survival rate appears to be between unknown and any known Cabin, I debated combining the features categories into a simple boolean of known cabin or unknown cabin. While I ultimately decided to leave the Cabins information broken out, combining the cabins could simplify the model without loosing much information.
Now that we have the Cabin feature finalized, let's look at Sex. The code below changes the "male", "female" strings into simple 1s and 0s which are more friendly to machine learning models.
combined = combined.replace(["male","female"],[1,0])
combined.tail(5)
combined.isnull().sum()
We're getting closer, let's look at the two missing Embarked values next.
sns.countplot(x="Embarked", data=combined, color="#222343")
sns.despine(trim=True, bottom=True)
Since we are only missing values in 2 out of the 1309 entries in the combined dataset for the Embarked feature, I chose to simply fill the missing value with the most common value. The graph above shows that the vast majority of passengers Embarked at Southampton "S" so let's fill the missing values with "S"
combined.Embarked = combined.Embarked.fillna("S")
combined.isnull().sum()
So far we have cleaned up the values for categorical variables, but we still have to tackle the continuous variables of fare and age. Let's look at the one missing Fare value next. Because there is only one missing value, there isn't a need to get super fancy on filling in this value (since it will not have a large effect on our data overall). However, we know that fare is probably closely correlated to the demographics of the passenger and the passenger class of the ticket. Let's look at the passenger who does not have Fare information:
combined[combined['Fare'].isnull()]
We now know that the missing Fare belongs to Mr. Thomas Storey, a 60 year old male that bought a third class ticket in Southampton. Using this information we can see what the median ticket price is for similar passengers.
grouped = combined.groupby(['Sex','Pclass','Embarked'])
grouped.median()
By grouping by Sex, Pclass, and Embarked, we can see that the median ticket price paid by a third class male that departed from Southampton was $8.05. We'll use that value to fill in the missing fare:
combined.Fare = combined.Fare.fillna(8.05)
combined.isnull().sum()
We now have filled in values for everything except Age. From the swarmplot earlier, we know that Age is an important variable in determining whether or not a passenger survives (most male children survive, most adult males parish). Therefore, we should try our best to fill in age using all the information available to us. To do this we can use K nearest neighbors to impute the missing age values. In order to implement a KNN regressor, we first have to change our categorical variables into dummy variables (since that is what the model expects as an input). The code below gets dummy values for Cabin, Embarked, and Pclass:
col = ["Cabin","Embarked","Pclass"]
combined = pd.get_dummies(combined,columns=col)
The final feature engineering I did was to combine the sibling/spouses (SibSp) aboard and the parent/children aboard (Parch) features into one "Party Size" feature. My thought was that families might have a greater chance of surviving than people traveling alone. Combining the two columns is pretty simple in Pandas as shown below:
combined["Party_Size"] = combined["SibSp"] + combined["Parch"] + 1
combined.head(5)
I then binned the party size variable into solo travelors (0-1], small parties (1-4], and large parties (4-12]:
party_bins = [0,1,4,12]
party_dummies = pd.get_dummies(pd.cut(combined["Party_Size"], bins=party_bins))
combined = pd.concat([combined, party_dummies], axis=1)
combined.head(5)
combined.rename(columns={'(0, 1]':'Solo', '(1, 4]':'SmallFam', '(4, 12]':'LargeFam'}, inplace = True)
combined.columns
sns.barplot(x='Party_Size', y='Survived', capsize=.25, color="#222343", errcolor="#bbca42", data=combined)
sns.despine(bottom=True)
We're now ready to impute our null Age values. The code below splits the combined dataset into a train and test set based on whether or not the Age feature is null:
AgeTrain = combined[combined['Age'].notnull()]
AgeTest = combined[combined['Age'].isnull()]
AgeTest.describe()
I only want to use features that I believe would be predictive of a passenger's age. Therefore I dropped the columns below (note: I decided to use the Party Size feature instead of the binned solo, small fam, large fam features):
dropcolumns = ["Name","PassengerId","Survived","Ticket","Age", "Solo", "SmallFam", "LargeFam"]
AgeTrainX = AgeTrain.drop(dropcolumns, axis=1)
AgeTrainY = AgeTrain["Age"]
AgeTestX = AgeTest.drop(dropcolumns, axis=1)
The code below imports and trains the K nearest neighbor regressor model:
from sklearn import preprocessing
AgeTrainX = preprocessing.scale(AgeTrainX)
AgeTestX = preprocessing.scale(AgeTestX)
from sklearn import neighbors
knnclass = neighbors.KNeighborsRegressor(n_neighbors=2)
knnclass.fit(AgeTrainX,AgeTrainY)
predage = knnclass.predict(AgeTrainX)
from sklearn import metrics
metrics.median_absolute_error(AgeTrainY,predage)
I first fit the model on AgeTrainX and AgeTrainY. This trains the model on all known entries with an Age Value. When I try to then "predict" the value of the known ages using the fitted model, there is a median error of 6 years. Not too shabby. The code below looks at the first ten predictions compared to the actual values:
print predage[:10]
print np.array(AgeTrainY[:10])
Now that we have a fitted model, let's predict the ages of our test set:
imputedage = knnclass.predict(AgeTestX)
imputedage
AgeTest["Age"] = imputedage
AgeTest
We've filled in the missing values now we have to combine the test data back in with the training data, sort by passenger ID and then split the data back into the original training and test sets:
finalcombined = AgeTrain.append(AgeTest).sort_values(by='PassengerId',ascending=1)
finalcombined.isnull().sum()
FinalTrain = finalcombined[finalcombined['Survived'].notnull()]
FinalTest = finalcombined[finalcombined['Survived'].isnull()]
FinalTest.tail(5)
Training Random Forest Classifier¶
Alright, now the exciting part. We have cleaned up our dataset and are ready to train our first classifier.
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score, cross_val_predict
dropcolumns = ["Name","PassengerId","Survived","Ticket", "Party_Size"]
FinalTrainX = FinalTrain.drop(dropcolumns, axis=1)
FinalTrainY = FinalTrain["Survived"]
FinalTestX = FinalTest.drop(dropcolumns, axis=1)
rfc = RandomForestClassifier(n_estimators=1000)
print"The cross validation scores are: {0}".format(cross_val_score(rfc, FinalTrainX, FinalTrainY))
predicted = cross_val_predict(rfc, FinalTrainX, FinalTrainY)
print "Precision: {0}".format(metrics.precision_score(FinalTrainY, predicted))
print "Recall: {0}".format(metrics.recall_score(FinalTrainY, predicted))
print "The F1 score is: {0}".format(metrics.f1_score(FinalTrainY, predicted))
trained_rfc = rfc.fit(FinalTrainX,FinalTrainY)
importances = trained_rfc.feature_importances_
std = np.std([trained_rfc.feature_importances_ for tree in trained_rfc.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
print FinalTrainX.columns
print
# Print the feature ranking
print("Feature ranking:")
for f in range(FinalTrainX.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
Tsurvived = trained_rfc.predict(FinalTestX)
Tsurvived
finalresults = pd.DataFrame(FinalTest["PassengerId"])
finalresults["Survived"] = Tsurvived.astype(int)
finalresults.head(5)
#finalresults.to_csv("/Titanic Kaggle/PartySize.csv", index = False)
Scored .75120¶
modelresults = FinalTest
modelresults["Survived"] = Tsurvived
sns.factorplot(x="Sex", y="Age", hue="Survived", kind="swarm", data=modelresults)
sns.despine(bottom=True)
sns.factorplot(x="Sex", y="Age", hue="Survived", kind="swarm", data=traindf, order=['female','male'])
sns.despine(bottom=True)
scaledTrainX = preprocessing.scale(FinalTrainX)
scaledTestX = preprocessing.scale(FinalTestX)
scaledTrainX
from sklearn import svm
from sklearn.cross_validation import train_test_split
C = 1 # SVM regularization parameter
X_train, X_test, y_train, y_test = train_test_split(scaledTrainX, FinalTrainY, test_size=0.4, random_state=0)
svc = svm.SVC(kernel='linear', C=C).fit(X_train, y_train)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X_train, y_train)
poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X_train, y_train)
lin_svc = svm.LinearSVC(C=C).fit(X_train, y_train)
print svc.score(X_test, y_test)
print rbf_svc.score(X_test, y_test)
print poly_svc.score(X_test, y_test)
print lin_svc.score(X_test, y_test)
print"The cross validation scores are: {0}".format(cross_val_score(lin_svc, scaledTrainX, FinalTrainY))
lsvc_predicted = cross_val_predict(lin_svc, scaledTrainX, FinalTrainY)
print "Precision: {0}".format(metrics.precision_score(FinalTrainY, lsvc_predicted))
print "Recall: {0}".format(metrics.recall_score(FinalTrainY, lsvc_predicted))
print "The F1 score is: {0}".format(metrics.f1_score(FinalTrainY, lsvc_predicted))
lin_svc = svm.LinearSVC(C=C).fit(scaledTrainX, FinalTrainY)
lin_svcpred = lin_svc.predict(scaledTestX)
lin_svcpred
svcresults = pd.DataFrame(FinalTest["PassengerId"])
svcresults["Survived"] = lin_svcpred.astype(int)
svcresults.head(5)
svcresults.to_csv("/Titanic Kaggle/svcResults.csv", index = False)
Scored .75598¶
from tpot import TPOTClassifier
pipeline_optimizer = TPOTClassifier(generations=50, population_size=25, num_cv_folds=5, random_state=42, verbosity=2)
pipeline_optimizer.fit(FinalTrainX, FinalTrainY)
print(pipeline_optimizer.score(FinalTrainX, FinalTrainY))
print(pipeline_optimizer.score(FinalTrainX, FinalTrainY))
pipeline_optimizer.export('/Titanic Kaggle/tpot_Titanic_pipeline.py')
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from tpot.operators.preprocessors import ZeroCount
exported_pipeline = make_pipeline(
StandardScaler(),
ZeroCount(),
GradientBoostingClassifier(learning_rate=0.24, max_features=0.24, n_estimators=500)
)
exported_pipeline.fit(FinalTrainX, FinalTrainY)
Tpotresults = exported_pipeline.predict(FinalTestX)
finalTpotresults = pd.DataFrame(FinalTest["PassengerId"])
finalTpotresults["Survived"] = Tpotresults.astype(int)
finalTpotresults.head(5)
finalresults.to_csv("/Titanic Kaggle/Tpot.csv", index = False)