Titanic: Machine Learning from Disaster¶

Kaggle Competition¶

Competition Description¶

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Data Ingestion¶

Let's start out by loading in the data. The Kaggle competition supplies both the training and test data in two .csv files. Download the data and simply point the Panda's read_csv to the file location. After loading in the data, I like to look at the first few lines to get an idea of the schema.

In [1]:

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns #visualizations
import scipy as scipy

trainpath = "/train.csv"
testpath = "/test.csv"

traindf = pd.read_csv(trainpath)
testdf = pd.read_csv(testpath)

In [2]:

traindf.head(5)

Out[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

As you can see, some of the feature names could be confusing if there wasn't any other documentation. Thankfully the Kaggle competition contained a data dictionary that explains what each feature represents.

FEATURE DESCRIPTIONS:¶

Feature	Description
survived	Survival (0 = No; 1 = Yes)
pclass	Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name	Name
sex	Sex
age	Age
sibsp	Number of Siblings/Spouses Aboard
parch	Number of Parents/Children Aboard
ticket	Ticket Number
fare	Passenger Fare
cabin	Cabin
embarked	Port of Embarkation(C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) Parent: Mother or Father of Passenger Aboard Titanic Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Initial Exploration¶

After loading up a new dataset it is often helpful to do some initial exploration of the data to get an idea of how "clean" the data is and note any obvious patterns. One of the easiest ways to do this in Pandas is through the .describe method. The describe method generates various summary statistics, excluding any missing/null values.

In [3]:

traindf.describe()

Out[3]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

In [4]:

testdf.describe()

Out[4]:

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	21.000000	0.000000	0.000000	7.895800
50%	1100.500000	3.000000	27.000000	0.000000	0.000000	14.454200
75%	1204.750000	3.000000	39.000000	1.000000	0.000000	31.500000
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200

By simply running the describe method we can learn quite a bit about our training and test sets. First, we can tell that our training set is about twice as large as our test set. Second, by looking at the "counts" we know that there are missing values that will need to be addressed in order to run any classifier algorithms. We can also get an idea of the average age, fare, and family size of the passengers.

Let's create some initial visualizations of our training set to see if any obvious patterns of survival emerge:

In [5]:

patagonia_colors = ["#222343", "#bbca42", "#7abbdb", "#eca935", "#ec7869", "#332212"]
current_palette = sns.color_palette(patagonia_colors)
sns.palplot(current_palette)
sns.set_palette(current_palette)
sns.set_context("talk")
sns.set_style("white")
sns.set_style("ticks")

Now that we have our plots formatted, let's get started by first comparing how many passengers in our training data survived.

In [6]:

sns.countplot(x="Survived", data=traindf, color="#222343")
sns.despine(bottom=True)

As expected there are more people who died than survived in the training data. Let's look a little deeper to see if there are any features that increase your chances of surviving. Even if all you know about the Titanic disaster was learned from the movie Titanic, you would know they tried to save women and children first. Let's take a look at the number of survivors by gender and age:

In [7]:

sns.barplot(x="Sex", y="Survived", capsize=.25, color="#222343", errcolor="#bbca42", data=traindf)
sns.despine(bottom=True)

The plot above shows the percentage of male and female passengers in the training set that survived. We can see that being female drastically increases a passenger's survival chances (~75% of females live vs. ~20% of males live). Let's see how age affects survival within the two sexs:

In [8]:

sns.swarmplot(x="Age", y="Sex", hue="Survived", data=traindf)
sns.despine(trim=True, left=True)

The swarmplot above is a useful way of looking at data when you want to look at the individual observations along with some representation of the underlying distribution. This plot is particularly useful for examining this dataset, because it allows us to quickly determine survival rate by both Age and Sex. For example, we can see that the chances of a male surviving are very low when compared to a female. However this does not apply to young males. It looks like being a child (no matter the Sex) drastically increased a passengers chances of survival. We can also see that there are more males than females in our dataset and that the distribution of ages is older for males than females. By looking at the proportion of green to blue dots it is easy to see why a simple gender model scores so well for this challenge!

In [9]:

#Lines on markers represent 95% ci
sns.pointplot(x="Pclass", y="Survived", hue="Sex", palette=["#222343","#ec7869"], order=[3,2,1], data=traindf)
sns.despine(bottom=True, right=False)

The slopegraph above shows the increase in average survival rate when examining Passenger Class. From the graph, it appears that being a female in either first or second class all but guaranteed your survival. The survival rate of females in third class is about half as high as females in first or second class. However, females in third class still have more than double the survival rate of males in thrid class. While there is some increase in the survival rate of males from second to third class, the likelihood of a male surviving is still below 50%.

Data Wrangling / Munging¶

We know from running the describe method earlier that there are some missing values in our dataset that need to be addressed. Let's look at those missing values a little closer:

In [10]:

print traindf.isnull().sum()
print "\n" 
print testdf.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [11]:

combined = traindf.append(testdf)

In [12]:

combined.describe()

Out[12]:

	Age	Fare	Parch	PassengerId	Pclass	SibSp	Survived
count	1046.000000	1308.000000	1309.000000	1309.000000	1309.000000	1309.000000	891.000000
mean	29.881138	33.295479	0.385027	655.000000	2.294882	0.498854	0.383838
std	14.413493	51.758668	0.865560	378.020061	0.837836	1.041658	0.486592
min	0.170000	0.000000	0.000000	1.000000	1.000000	0.000000	0.000000
25%	21.000000	7.895800	0.000000	328.000000	2.000000	0.000000	0.000000
50%	28.000000	14.454200	0.000000	655.000000	3.000000	0.000000	0.000000
75%	39.000000	31.275000	0.000000	982.000000	3.000000	1.000000	1.000000
max	80.000000	512.329200	9.000000	1309.000000	3.000000	8.000000	1.000000

In [13]:

combined.tail(5)

Out[13]:

	Age	Cabin	Embarked	Fare	Name	Parch	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
413	NaN	NaN	S	8.0500	Spector, Mr. Woolf	0	1305	3	male	0	NaN	A.5. 3236
414	39.0	C105	C	108.9000	Oliva y Ocana, Dona. Fermina	0	1306	1	female	0	NaN	PC 17758
415	38.5	NaN	S	7.2500	Saether, Mr. Simon Sivertsen	0	1307	3	male	0	NaN	SOTON/O.Q. 3101262
416	NaN	NaN	S	8.0500	Ware, Mr. Frederick	0	1308	3	male	0	NaN	359309
417	NaN	NaN	C	22.3583	Peter, Master. Michael J	1	1309	3	male	1	NaN	2668

In [14]:

combined.isnull().sum()

Out[14]:

Age             263
Cabin          1014
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64

Because the training dataset is relatively small, I combined the training and test dataset into a new "combined" dataset will be used to interpolate the missing values. I believe this gives the best chance at an accurate result for filling in the missing values. This step may not be necessary for problems with a large training set.

From the table above, we can see that the Cabin feature has the highest number of missing values. Because this is a categorical variable, I decided to simply create a new category, "U" for unknown. I didn't see the importance of noting the individual room of each passenger, so in order to simplify the feature, I simply took the first letter of the Cabin. The first letter should indicate the deck of the ship where the passenger's room was located. The code below fills in "na" for values with "U" and then maps a labmda function that replaces each cabin number with the first letter of the entry. i.e. C105 becomes "C"

In [15]:

#Replaces value of missing cabin with U for unknown
combined.Cabin.fillna("U",inplace=True) 

# mapping each Cabin value with the cabin letter
combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])

combined.tail(5)

Out[15]:

	Age	Cabin	Embarked	Fare	Name	Parch	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
413	NaN	U	S	8.0500	Spector, Mr. Woolf	0	1305	3	male	0	NaN	A.5. 3236
414	39.0	C	C	108.9000	Oliva y Ocana, Dona. Fermina	0	1306	1	female	0	NaN	PC 17758
415	38.5	U	S	7.2500	Saether, Mr. Simon Sivertsen	0	1307	3	male	0	NaN	SOTON/O.Q. 3101262
416	NaN	U	S	8.0500	Ware, Mr. Frederick	0	1308	3	male	0	NaN	359309
417	NaN	U	C	22.3583	Peter, Master. Michael J	1	1309	3	male	1	NaN	2668

Now that we have our Cabin feature completed, let's see what the survival rate of passengers is broken down by cabin:

In [16]:

sns.barplot(x="Cabin",y="Survived", capsize=.25, color="#222343", errcolor="#bbca42", data = combined)
sns.despine(trim=True, bottom=True)

From the graph above it looks like the survival rates of known Cabins is about twice as high as the unknown ("U") cabin. Furthermore the survival rate of konwn cabins looks to be somewhere between .5 and .8. Because the biggest difference in survival rate appears to be between unknown and any known Cabin, I debated combining the features categories into a simple boolean of known cabin or unknown cabin. While I ultimately decided to leave the Cabins information broken out, combining the cabins could simplify the model without loosing much information.

Now that we have the Cabin feature finalized, let's look at Sex. The code below changes the "male", "female" strings into simple 1s and 0s which are more friendly to machine learning models.

In [17]:

combined = combined.replace(["male","female"],[1,0])
combined.tail(5)

Out[17]:

	Age	Cabin	Embarked	Fare	Name	Parch	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
413	NaN	U	S	8.0500	Spector, Mr. Woolf	0	1305	3	1	0	NaN	A.5. 3236
414	39.0	C	C	108.9000	Oliva y Ocana, Dona. Fermina	0	1306	1	0	0	NaN	PC 17758
415	38.5	U	S	7.2500	Saether, Mr. Simon Sivertsen	0	1307	3	1	0	NaN	SOTON/O.Q. 3101262
416	NaN	U	S	8.0500	Ware, Mr. Frederick	0	1308	3	1	0	NaN	359309
417	NaN	U	C	22.3583	Peter, Master. Michael J	1	1309	3	1	1	NaN	2668

In [18]:

combined.isnull().sum()

Out[18]:

Age            263
Cabin            0
Embarked         2
Fare             1
Name             0
Parch            0
PassengerId      0
Pclass           0
Sex              0
SibSp            0
Survived       418
Ticket           0
dtype: int64

We're getting closer, let's look at the two missing Embarked values next.

In [19]:

sns.countplot(x="Embarked", data=combined, color="#222343")
sns.despine(trim=True, bottom=True)

Since we are only missing values in 2 out of the 1309 entries in the combined dataset for the Embarked feature, I chose to simply fill the missing value with the most common value. The graph above shows that the vast majority of passengers Embarked at Southampton "S" so let's fill the missing values with "S"

In [20]:

combined.Embarked = combined.Embarked.fillna("S")

In [21]:

combined.isnull().sum()

Out[21]:

Age            263
Cabin            0
Embarked         0
Fare             1
Name             0
Parch            0
PassengerId      0
Pclass           0
Sex              0
SibSp            0
Survived       418
Ticket           0
dtype: int64

So far we have cleaned up the values for categorical variables, but we still have to tackle the continuous variables of fare and age. Let's look at the one missing Fare value next. Because there is only one missing value, there isn't a need to get super fancy on filling in this value (since it will not have a large effect on our data overall). However, we know that fare is probably closely correlated to the demographics of the passenger and the passenger class of the ticket. Let's look at the passenger who does not have Fare information:

In [22]:

combined[combined['Fare'].isnull()]

Out[22]:

	Age	Cabin	Embarked	Fare	Name	Parch	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
152	60.5	U	S	NaN	Storey, Mr. Thomas	0	1044	3	1	0	NaN	3701

We now know that the missing Fare belongs to Mr. Thomas Storey, a 60 year old male that bought a third class ticket in Southampton. Using this information we can see what the median ticket price is for similar passengers.

In [23]:

grouped = combined.groupby(['Sex','Pclass','Embarked'])
grouped.median()

Out[23]:

			Age	Fare	Parch	PassengerId	SibSp	Survived
Sex	Pclass	Embarked
0	1	C	38.00	83.15830	0.0	701.0	0.0	1.0
		Q	35.00	90.00000	0.0	858.0	1.0	1.0
		S	35.00	79.65000	0.0	731.0	1.0	1.0
	2	C	23.00	27.72080	0.0	867.0	1.0	1.0
		Q	30.00	12.35000	0.0	313.5	0.0	1.0
		S	28.00	23.00000	0.0	581.0	0.0	1.0
	3	C	15.00	14.45420	1.0	645.0	0.0	1.0
		Q	22.00	7.75000	0.0	669.5	0.0	1.0
		S	22.00	13.77500	0.0	568.0	1.0	0.0
1	1	C	39.00	62.66875	0.0	690.5	0.0	0.0
		Q	44.00	90.00000	0.0	246.0	2.0	0.0
		S	42.00	35.50000	0.0	624.0	0.0	0.0
	2	C	29.00	15.04580	0.0	818.0	0.0	0.0
		Q	59.00	12.35000	0.0	908.0	0.0	0.0
		S	29.00	13.00000	0.0	696.0	0.0	0.0
	3	C	24.25	7.22920	0.0	747.5	0.0	0.0
		Q	25.00	7.75000	0.0	704.0	0.0	0.0
		S	25.00	8.05000	0.0	638.5	0.0	0.0

By grouping by Sex, Pclass, and Embarked, we can see that the median ticket price paid by a third class male that departed from Southampton was $8.05. We'll use that value to fill in the missing fare:

In [24]:

combined.Fare = combined.Fare.fillna(8.05)
combined.isnull().sum()

Out[24]:

Age            263
Cabin            0
Embarked         0
Fare             0
Name             0
Parch            0
PassengerId      0
Pclass           0
Sex              0
SibSp            0
Survived       418
Ticket           0
dtype: int64

We now have filled in values for everything except Age. From the swarmplot earlier, we know that Age is an important variable in determining whether or not a passenger survives (most male children survive, most adult males parish). Therefore, we should try our best to fill in age using all the information available to us. To do this we can use K nearest neighbors to impute the missing age values. In order to implement a KNN regressor, we first have to change our categorical variables into dummy variables (since that is what the model expects as an input). The code below gets dummy values for Cabin, Embarked, and Pclass:

In [25]:

col = ["Cabin","Embarked","Pclass"]
combined = pd.get_dummies(combined,columns=col)

The final feature engineering I did was to combine the sibling/spouses (SibSp) aboard and the parent/children aboard (Parch) features into one "Party Size" feature. My thought was that families might have a greater chance of surviving than people traveling alone. Combining the two columns is pretty simple in Pandas as shown below:

In [26]:

combined["Party_Size"] = combined["SibSp"] + combined["Parch"] + 1
combined.head(5)

Out[26]:

	Age	Fare	Name	PassengerId	Sex	SibSp	Survived	Ticket	...	Cabin_U	Embarked_C	Embarked_S	Pclass_1	Pclass_3	Party_Size
0	22.0	7.2500	Braund, Mr. Owen Harris	1	1	1	0.0	A/5 21171	...	1.0	0.0	1.0	0.0	1.0	2
1	38.0	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	2	0	1	1.0	PC 17599	...	0.0	1.0	0.0	1.0	0.0	2
2	26.0	7.9250	Heikkinen, Miss. Laina	3	0	0	1.0	STON/O2. 3101282	...	1.0	0.0	1.0	0.0	1.0	1
3	35.0	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	0	1	1.0	113803	...	0.0	0.0	1.0	1.0	0.0	2
4	35.0	8.0500	Allen, Mr. William Henry	5	1	0	0.0	373450	...	1.0	0.0	1.0	0.0	1.0	1

5 rows × 25 columns

I then binned the party size variable into solo travelors (0-1], small parties (1-4], and large parties (4-12]:

In [27]:

party_bins = [0,1,4,12]
party_dummies = pd.get_dummies(pd.cut(combined["Party_Size"], bins=party_bins))
combined = pd.concat([combined, party_dummies], axis=1)
combined.head(5)

Out[27]:

	Age	Fare	Name	PassengerId	Sex	SibSp	Survived	Ticket	...	Embarked_C	Embarked_S	Pclass_1	Pclass_3	Party_Size	(0, 1]	(1, 4]
0	22.0	7.2500	Braund, Mr. Owen Harris	1	1	1	0.0	A/5 21171	...	0.0	1.0	0.0	1.0	2	0.0	1.0
1	38.0	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	2	0	1	1.0	PC 17599	...	1.0	0.0	1.0	0.0	2	0.0	1.0
2	26.0	7.9250	Heikkinen, Miss. Laina	3	0	0	1.0	STON/O2. 3101282	...	0.0	1.0	0.0	1.0	1	1.0	0.0
3	35.0	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	0	1	1.0	113803	...	0.0	1.0	1.0	0.0	2	0.0	1.0
4	35.0	8.0500	Allen, Mr. William Henry	5	1	0	0.0	373450	...	0.0	1.0	0.0	1.0	1	1.0	0.0

5 rows × 28 columns

In [28]:

combined.rename(columns={'(0, 1]':'Solo', '(1, 4]':'SmallFam', '(4, 12]':'LargeFam'}, inplace = True)
combined.columns

Out[28]:

Index([u'Age', u'Fare', u'Name', u'Parch', u'PassengerId', u'Sex', u'SibSp',
       u'Survived', u'Ticket', u'Cabin_A', u'Cabin_B', u'Cabin_C', u'Cabin_D',
       u'Cabin_E', u'Cabin_F', u'Cabin_G', u'Cabin_T', u'Cabin_U',
       u'Embarked_C', u'Embarked_Q', u'Embarked_S', u'Pclass_1', u'Pclass_2',
       u'Pclass_3', u'Party_Size', u'Solo', u'SmallFam', u'LargeFam'],
      dtype='object')

In [29]:

sns.barplot(x='Party_Size', y='Survived', capsize=.25, color="#222343", errcolor="#bbca42", data=combined)
sns.despine(bottom=True)

We're now ready to impute our null Age values. The code below splits the combined dataset into a train and test set based on whether or not the Age feature is null:

In [30]:

AgeTrain = combined[combined['Age'].notnull()]
AgeTest = combined[combined['Age'].isnull()]

In [31]:

AgeTest.describe()

Out[31]:

	Age	Fare	Parch	PassengerId	Sex	SibSp	Survived	Cabin_A	Cabin_B	Cabin_C	...	Embarked_C	Embarked_Q	Embarked_S	Pclass_1	Pclass_2	Pclass_3	Party_Size	Solo	SmallFam	LargeFam
count	0.0	263.000000	263.000000	263.000000	263.000000	263.000000	177.000000	263.000000	263.000000	263.000000	...	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.00000	263.000000
mean	NaN	19.823319	0.243346	653.558935	0.703422	0.482890	0.293785	0.011407	0.007605	0.030418	...	0.220532	0.277567	0.501901	0.148289	0.060837	0.790875	1.726236	0.760456	0.18251	0.057034
std	NaN	27.550667	0.949941	380.161812	0.457620	1.448443	0.456787	0.106394	0.087038	0.172063	...	0.415396	0.448652	0.500950	0.356064	0.239486	0.407460	2.017376	0.427619	0.38700	0.232350
min	NaN	0.000000	0.000000	6.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.00000	0.000000
25%	NaN	7.750000	0.000000	335.500000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000	0.00000	0.000000
50%	NaN	8.050000	0.000000	630.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	1.000000	1.000000	0.00000	0.000000
75%	NaN	22.804150	0.000000	999.500000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	...	0.000000	1.000000	1.000000	0.000000	0.000000	1.000000	1.000000	1.000000	0.00000	0.000000
max	NaN	227.525000	9.000000	1309.000000	1.000000	8.000000	1.000000	1.000000	1.000000	1.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	11.000000	1.000000	1.00000	1.000000

8 rows × 26 columns

I only want to use features that I believe would be predictive of a passenger's age. Therefore I dropped the columns below (note: I decided to use the Party Size feature instead of the binned solo, small fam, large fam features):

In [32]:

dropcolumns = ["Name","PassengerId","Survived","Ticket","Age", "Solo", "SmallFam", "LargeFam"]
AgeTrainX = AgeTrain.drop(dropcolumns, axis=1)
AgeTrainY = AgeTrain["Age"]
AgeTestX = AgeTest.drop(dropcolumns, axis=1)

The code below imports and trains the K nearest neighbor regressor model:

In [33]:

from sklearn import preprocessing
AgeTrainX = preprocessing.scale(AgeTrainX)
AgeTestX = preprocessing.scale(AgeTestX)

In [34]:

from sklearn import neighbors
knnclass = neighbors.KNeighborsRegressor(n_neighbors=2)

knnclass.fit(AgeTrainX,AgeTrainY)

Out[34]:

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=2, p=2,
          weights='uniform')

In [35]:

predage = knnclass.predict(AgeTrainX)

In [36]:

from sklearn import metrics
metrics.median_absolute_error(AgeTrainY,predage)

Out[36]:

6.0

I first fit the model on AgeTrainX and AgeTrainY. This trains the model on all known entries with an Age Value. When I try to then "predict" the value of the known ages using the fitted model, there is a median error of 6 years. Not too shabby. The code below looks at the first ten predictions compared to the actual values:

In [37]:

print predage[:10]
print np.array(AgeTrainY[:10])

[ 19.5  36.5  26.5  26.5  27.5  50.5   4.   31.5  19.    2.5]
[ 22.  38.  26.  35.  35.  54.   2.  27.  14.   4.]

Now that we have a fitted model, let's predict the ages of our test set:

In [38]:

imputedage = knnclass.predict(AgeTestX)
imputedage

Out[38]:

array([ 29.75 ,  33.5  ,  27.5  ,  34.   ,  17.5  ,  19.   ,  31.5  ,
        17.5  ,  34.   ,  34.   ,  29.75 ,  30.75 ,  17.5  ,  22.5  ,
        51.   ,  55.5  ,  13.5  ,  19.   ,  29.75 ,  17.5  ,  29.75 ,
        29.75 ,  19.   ,  19.   ,  24.5  ,  29.75 ,  19.   ,  18.5  ,
        15.   ,  19.5  ,  29.75 ,  11.5  ,  30.5  ,  22.5  ,  17.5  ,
        13.   ,  32.   ,  29.5  ,  24.5  ,  19.   ,  17.5  ,  11.5  ,
        36.5  ,  19.   ,   5.5  ,  17.5  ,  14.75 ,  24.5  ,  19.5  ,
        36.   ,  19.   ,  17.5  ,  22.5  ,  17.5  ,  21.5  ,  29.5  ,
        55.5  ,  51.   ,  17.5  ,  22.5  ,  44.75 ,  29.75 ,  43.5  ,
        11.5  ,  20.   ,  42.   ,  19.   ,  24.5  ,  51.   ,  34.   ,
        17.5  ,  17.5  ,  30.75 ,  27.5  ,  17.5  ,  36.   ,  19.   ,
        19.5  ,   5.5  ,  19.   ,  19.5  ,  21.5  ,  17.5  ,  34.   ,
        19.5  ,  19.   ,  24.5  ,  29.75 ,  30.75 ,  29.75 ,  57.   ,
        19.   ,  29.75 ,  21.5  ,  19.5  ,  19.5  ,  29.5  ,  21.5  ,
         5.5  ,  30.75 ,  18.5  ,  35.5  ,  17.5  ,  22.5  ,  29.75 ,
        29.   ,  34.   ,  34.   ,  48.   ,  34.   ,  15.   ,  35.5  ,
        32.   ,  19.   ,  40.   ,  19.   ,  29.75 ,  17.5  ,  34.   ,
        17.5  ,  14.75 ,  18.5  ,  29.75 ,  31.5  ,  19.5  ,  34.   ,
        19.   ,  22.5  ,  19.5  ,  24.5  ,  19.   ,  19.   ,  43.5  ,
        30.75 ,  29.   ,  19.5  ,  19.   ,  17.5  ,  19.   ,  19.   ,
        35.   ,  21.5  ,  17.5  ,  29.   ,  17.5  ,  13.5  ,  51.   ,
        35.5  ,  17.5  ,  21.5  ,  19.   ,  19.   ,  54.   ,  35.5  ,
        58.5  ,  30.75 ,  34.   ,  22.   ,  19.   ,  13.5  ,  19.   ,
        13.   ,  58.5  ,  44.5  ,  19.5  ,  29.   ,  19.   ,  34.   ,
        29.75 ,  41.   ,  11.5  ,  22.   ,  34.   ,  13.   ,  24.75 ,
        19.   ,  16.585,  19.   ,  23.5  ,  22.5  ,  16.585,  17.5  ,
        29.   ,  54.   ,  19.   ,  32.   ,  30.75 ,  19.5  ,  29.75 ,
        19.   ,  43.5  ,  22.5  ,  17.5  ,  19.   ,  29.75 ,  19.   ,
        19.   ,  29.75 ,  17.5  ,  34.   ,  36.5  ,  19.   ,  20.   ,
        38.5  ,  15.   ,  50.5  ,  22.5  ,  34.   ,  17.5  ,  19.5  ,
        42.5  ,  19.5  ,  34.   ,  19.   ,  13.   ,  22.5  ,  17.5  ,
        13.5  ,  55.5  ,  19.5  ,  17.5  ,  29.75 ,  15.   ,  17.5  ,
        19.   ,  19.   ,  13.5  ,  14.75 ,  19.5  ,  19.   ,  19.   ,
        43.5  ,  19.5  ,  17.5  ,  19.   ,  24.5  ,  34.   ,  17.5  ,
        19.5  ,  25.   ,  29.75 ,  22.5  ,  34.   ,  22.5  ,  28.   ,
        17.5  ,  19.5  ,  34.   ,  34.   ,  41.5  ,  34.5  ,  19.   ,
        19.   ,  41.5  ,  22.5  ,  19.   ,  17.5  ,  33.5  ,  17.5  ,
        17.5  ,  29.75 ,  29.75 ,  13.5  ])

In [39]:

AgeTest["Age"] = imputedage

/Applications/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

In [40]:

AgeTest

Out[40]:

	Age	Fare	Name	Parch	PassengerId	Sex	SibSp	Survived	Ticket	Cabin_A	...	Embarked_C	Embarked_Q	Embarked_S	Pclass_1	Pclass_2	Pclass_3	Party_Size	Solo	SmallFam	LargeFam
5	29.75	8.4583	Moran, Mr. James	0	6	1	0	0.0	330877	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
17	33.50	13.0000	Williams, Mr. Charles Eugene	0	18	1	0	1.0	244373	0.0	...	0.0	0.0	1.0	0.0	1.0	0.0	1	1.0	0.0	0.0
19	27.50	7.2250	Masselmani, Mrs. Fatima	0	20	0	0	1.0	2649	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
26	34.00	7.2250	Emir, Mr. Farred Chehab	0	27	1	0	0.0	2631	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
28	17.50	7.8792	O'Dwyer, Miss. Ellen "Nellie"	0	29	0	0	1.0	330959	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
29	19.00	7.8958	Todoroff, Mr. Lalio	0	30	1	0	0.0	349216	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
31	31.50	146.5208	Spencer, Mrs. William Augustus (Marie Eugenie)	0	32	0	1	1.0	PC 17569	0.0	...	1.0	0.0	0.0	1.0	0.0	0.0	2	0.0	1.0	0.0
32	17.50	7.7500	Glynn, Miss. Mary Agatha	0	33	0	0	1.0	335677	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
36	34.00	7.2292	Mamee, Mr. Hanna	0	37	1	0	1.0	2677	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
42	34.00	7.8958	Kraeff, Mr. Theodor	0	43	1	0	0.0	349253	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
45	29.75	8.0500	Rogers, Mr. William John	0	46	1	0	0.0	S.C./A.4. 23567	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
46	30.75	15.5000	Lennon, Mr. Denis	0	47	1	1	0.0	370371	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	2	0.0	1.0	0.0
47	17.50	7.7500	O'Driscoll, Miss. Bridget	0	48	0	0	1.0	14311	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
48	22.50	21.6792	Samaan, Mr. Youssef	0	49	1	2	0.0	2662	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	3	0.0	1.0	0.0
55	51.00	35.5000	Woolner, Mr. Hugh	0	56	1	0	1.0	19947	0.0	...	0.0	0.0	1.0	1.0	0.0	0.0	1	1.0	0.0	0.0
64	55.50	27.7208	Stewart, Mr. Albert A	0	65	1	0	0.0	PC 17605	0.0	...	1.0	0.0	0.0	1.0	0.0	0.0	1	1.0	0.0	0.0
65	13.50	15.2458	Moubarek, Master. Gerios	1	66	1	1	1.0	2661	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	3	0.0	1.0	0.0
76	19.00	7.8958	Staneff, Mr. Ivan	0	77	1	0	0.0	349208	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
77	29.75	8.0500	Moutal, Mr. Rahamin Haim	0	78	1	0	0.0	374746	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
82	17.50	7.7875	McDermott, Miss. Brigdet Delia	0	83	0	0	1.0	330932	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
87	29.75	8.0500	Slocovski, Mr. Selman Francis	0	88	1	0	0.0	SOTON/OQ 392086	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
95	29.75	8.0500	Shorney, Mr. Charles Joseph	0	96	1	0	0.0	374910	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
101	19.00	7.8958	Petroff, Mr. Pastcho ("Pentcho")	0	102	1	0	0.0	349215	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
107	19.00	7.7750	Moss, Mr. Albert Johan	0	108	1	0	1.0	312991	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
109	24.50	24.1500	Moran, Miss. Bertha	0	110	0	1	1.0	371110	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	2	0.0	1.0	0.0
121	29.75	8.0500	Moore, Mr. Leonard Charles	0	122	1	0	0.0	A4. 54510	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
126	19.00	7.7500	McMahon, Mr. Martin	0	127	1	0	0.0	370372	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
128	18.50	22.3583	Peter, Miss. Anna	1	129	0	1	1.0	2668	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	3	0.0	1.0	0.0
140	15.00	15.2458	Boulos, Mrs. Joseph (Sultana)	2	141	0	0	0.0	2678	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	3	0.0	1.0	0.0
154	19.50	7.3125	Olsen, Mr. Ole Martin	0	155	1	0	0.0	Fa 265302	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
268	17.50	8.0500	Howard, Miss. May Elizabeth	0	1160	0	0	NaN	A. 2. 39186	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
271	19.00	7.7500	Fox, Mr. Patrick	0	1163	1	0	NaN	368573	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
273	24.50	15.5000	Lennon, Miss. Mary	0	1165	0	1	NaN	370371	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	2	0.0	1.0	0.0
274	34.00	7.2250	Saade, Mr. Jean Nassr	0	1166	1	0	NaN	2676	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
282	17.50	7.7500	Fleming, Miss. Honora	0	1174	0	0	NaN	364859	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
286	19.50	7.2500	Franklin, Mr. Charles (Charles Fardon)	0	1178	1	0	NaN	SOTON/O.Q. 3101314	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
288	25.00	7.2292	Mardirosian, Mr. Sarkis	0	1180	1	0	NaN	2655	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
289	29.75	8.0500	Ford, Mr. Arthur	0	1181	1	0	NaN	A/5 1478	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
290	22.50	39.6000	Rheims, Mr. George Alexander Lucien	0	1182	1	0	NaN	PC 17607	0.0	...	0.0	0.0	1.0	1.0	0.0	0.0	1	1.0	0.0	0.0
292	34.00	7.2292	Nasr, Mr. Mustafa	0	1184	1	0	NaN	2652	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
297	22.50	21.6792	Samaan, Mr. Hanna	0	1189	1	2	NaN	2662	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	3	0.0	1.0	0.0
301	28.00	15.0458	Malachard, Mr. Noel	0	1193	1	0	NaN	237735	0.0	...	1.0	0.0	0.0	0.0	1.0	0.0	1	1.0	0.0	0.0
304	17.50	7.7500	McCarthy, Miss. Catherine Katie""	0	1196	0	0	NaN	383123	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
312	19.50	7.5750	Sadowitz, Mr. Harry	0	1204	1	0	NaN	LP 1588	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
332	34.00	7.2250	Thomas, Mr. Tannous	0	1224	1	0	NaN	2684	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
339	34.00	7.2292	Betros, Master. Seman	0	1231	1	0	NaN	2622	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
342	41.50	69.5500	Sage, Mr. John George	9	1234	1	1	NaN	CA. 2343	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	11	0.0	0.0	1.0
344	34.50	14.5000	van Billiard, Master. James William	1	1236	1	1	NaN	A/5. 851	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	3	0.0	1.0	0.0
357	19.00	7.8792	Lockyer, Mr. Edward	0	1249	1	0	NaN	1222	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
358	19.00	7.7500	O'Keefe, Mr. Patrick	0	1250	1	0	NaN	368402	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
365	41.50	69.5500	Sage, Mrs. John (Annie Bullen)	9	1257	0	1	NaN	CA. 2343	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	11	0.0	0.0	1.0
366	22.50	14.4583	Caram, Mr. Joseph	0	1258	1	1	NaN	2689	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	2	0.0	1.0	0.0
380	19.00	7.7500	O'Connor, Mr. Patrick	0	1272	1	0	NaN	366713	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
382	17.50	14.5000	Risien, Mrs. Samuel (Emma)	0	1274	0	0	NaN	364498	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
384	33.50	12.8750	Wheeler, Mr. Edwin Frederick""	0	1276	1	0	NaN	SC/PARIS 2159	0.0	...	0.0	0.0	1.0	0.0	1.0	0.0	1	1.0	0.0	0.0
408	17.50	7.7208	Riordan, Miss. Johanna Hannah""	0	1300	0	0	NaN	334915	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
410	17.50	7.7500	Naughton, Miss. Hannah	0	1302	0	0	NaN	365237	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1	1.0	0.0	0.0
413	29.75	8.0500	Spector, Mr. Woolf	0	1305	1	0	NaN	A.5. 3236	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
416	29.75	8.0500	Ware, Mr. Frederick	0	1308	1	0	NaN	359309	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0	1	1.0	0.0	0.0
417	13.50	22.3583	Peter, Master. Michael J	1	1309	1	1	NaN	2668	0.0	...	1.0	0.0	0.0	0.0	0.0	1.0	3	0.0	1.0	0.0

263 rows × 28 columns

We've filled in the missing values now we have to combine the test data back in with the training data, sort by passenger ID and then split the data back into the original training and test sets:

In [41]:

finalcombined = AgeTrain.append(AgeTest).sort_values(by='PassengerId',ascending=1)

In [42]:

finalcombined.isnull().sum()

Out[42]:

Age              0
Fare             0
Name             0
Parch            0
PassengerId      0
Sex              0
SibSp            0
Survived       418
Ticket           0
Cabin_A          0
Cabin_B          0
Cabin_C          0
Cabin_D          0
Cabin_E          0
Cabin_F          0
Cabin_G          0
Cabin_T          0
Cabin_U          0
Embarked_C       0
Embarked_Q       0
Embarked_S       0
Pclass_1         0
Pclass_2         0
Pclass_3         0
Party_Size       0
Solo             0
SmallFam         0
LargeFam         0
dtype: int64

In [43]:

FinalTrain = finalcombined[finalcombined['Survived'].notnull()]
FinalTest = finalcombined[finalcombined['Survived'].isnull()]
FinalTest.tail(5)

Out[43]:

	Age	Fare	Name	Parch	PassengerId	Sex	SibSp	Survived	Ticket	...	Embarked_C	Embarked_S	Pclass_1	Pclass_3	Party_Size	Solo	SmallFam
413	29.75	8.0500	Spector, Mr. Woolf	0	1305	1	0	NaN	A.5. 3236	...	0.0	1.0	0.0	1.0	1	1.0	0.0
414	39.00	108.9000	Oliva y Ocana, Dona. Fermina	0	1306	0	0	NaN	PC 17758	...	1.0	0.0	1.0	0.0	1	1.0	0.0
415	38.50	7.2500	Saether, Mr. Simon Sivertsen	0	1307	1	0	NaN	SOTON/O.Q. 3101262	...	0.0	1.0	0.0	1.0	1	1.0	0.0
416	29.75	8.0500	Ware, Mr. Frederick	0	1308	1	0	NaN	359309	...	0.0	1.0	0.0	1.0	1	1.0	0.0
417	13.50	22.3583	Peter, Master. Michael J	1	1309	1	1	NaN	2668	...	1.0	0.0	0.0	1.0	3	0.0	1.0

5 rows × 28 columns

Training Random Forest Classifier¶

Alright, now the exciting part. We have cleaned up our dataset and are ready to train our first classifier.

In [60]:

from sklearn.ensemble import RandomForestClassifier 
from sklearn.cross_validation import cross_val_score, cross_val_predict

dropcolumns = ["Name","PassengerId","Survived","Ticket", "Party_Size"]
FinalTrainX = FinalTrain.drop(dropcolumns, axis=1)
FinalTrainY = FinalTrain["Survived"]
FinalTestX = FinalTest.drop(dropcolumns, axis=1)

rfc = RandomForestClassifier(n_estimators=1000)
print"The cross validation scores are: {0}".format(cross_val_score(rfc, FinalTrainX, FinalTrainY))
predicted = cross_val_predict(rfc, FinalTrainX, FinalTrainY)
print "Precision: {0}".format(metrics.precision_score(FinalTrainY, predicted))
print "Recall: {0}".format(metrics.recall_score(FinalTrainY, predicted))
print "The F1 score is: {0}".format(metrics.f1_score(FinalTrainY, predicted))

trained_rfc = rfc.fit(FinalTrainX,FinalTrainY)

The cross validation scores are: [ 0.75757576  0.82828283  0.81481481]
Precision: 0.740853658537
Recall: 0.710526315789
The F1 score is: 0.725373134328

In [45]:

importances = trained_rfc.feature_importances_
std = np.std([trained_rfc.feature_importances_ for tree in trained_rfc.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

print FinalTrainX.columns
print 
# Print the feature ranking
print("Feature ranking:")

for f in range(FinalTrainX.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

Index([u'Age', u'Fare', u'Parch', u'Sex', u'SibSp', u'Cabin_A', u'Cabin_B',
       u'Cabin_C', u'Cabin_D', u'Cabin_E', u'Cabin_F', u'Cabin_G', u'Cabin_T',
       u'Cabin_U', u'Embarked_C', u'Embarked_Q', u'Embarked_S', u'Pclass_1',
       u'Pclass_2', u'Pclass_3', u'Solo', u'SmallFam', u'LargeFam'],
      dtype='object')

Feature ranking:
1. feature 0 (0.264983)
2. feature 1 (0.228558)
3. feature 3 (0.220372)
4. feature 19 (0.037564)
5. feature 13 (0.031229)
6. feature 2 (0.029462)
7. feature 4 (0.027172)
8. feature 21 (0.023452)
9. feature 17 (0.020682)
10. feature 16 (0.015897)
11. feature 18 (0.014287)
12. feature 14 (0.012664)
13. feature 22 (0.012559)
14. feature 20 (0.012379)
15. feature 9 (0.009731)
16. feature 15 (0.008369)
17. feature 6 (0.008101)
18. feature 7 (0.007124)
19. feature 8 (0.006985)
20. feature 5 (0.003449)
21. feature 10 (0.003236)
22. feature 11 (0.001214)
23. feature 12 (0.000531)

In [46]:

Tsurvived = trained_rfc.predict(FinalTestX)

In [47]:

Tsurvived

Out[47]:

array([ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,
        0.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,
        1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
        1.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,
        1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,
        1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,
        0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,
        0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,
        1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,
        1.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,
        0.,  1.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
        1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,
        0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,
        1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,
        0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,
        0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,
        1.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,
        0.,  1.,  1.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  0.,
        0.,  1.])

In [48]:

finalresults = pd.DataFrame(FinalTest["PassengerId"])
finalresults["Survived"] = Tsurvived.astype(int)
finalresults.head(5)

Out[48]:

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	1
4	896	0

In [49]:

#finalresults.to_csv("/Titanic Kaggle/PartySize.csv", index = False)

Scored .75120¶

In [50]:

modelresults = FinalTest
modelresults["Survived"] = Tsurvived
sns.factorplot(x="Sex", y="Age", hue="Survived", kind="swarm", data=modelresults)
sns.despine(bottom=True)

/Applications/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [51]:

sns.factorplot(x="Sex", y="Age", hue="Survived", kind="swarm", data=traindf, order=['female','male'])
sns.despine(bottom=True)

In [55]:

scaledTrainX = preprocessing.scale(FinalTrainX)
scaledTestX = preprocessing.scale(FinalTestX)
scaledTrainX

Out[55]:

array([[-0.49729484, -0.50244517, -0.47367361, ..., -1.2316449 ,
         1.4322604 , -0.2734756 ],
       [ 0.65162818,  0.78684529, -0.47367361, ..., -1.2316449 ,
         1.4322604 , -0.2734756 ],
       [-0.21006408, -0.48885426, -0.47367361, ...,  0.81192233,
        -0.69819706, -0.2734756 ],
       ..., 
       [-0.88613347, -0.17626324,  2.00893337, ..., -1.2316449 ,
         1.4322604 , -0.2734756 ],
       [-0.21006408, -0.04438104, -0.47367361, ...,  0.81192233,
        -0.69819706, -0.2734756 ],
       [ 0.22078205, -0.49237783, -0.47367361, ...,  0.81192233,
        -0.69819706, -0.2734756 ]])

In [61]:

from sklearn import svm
from sklearn.cross_validation import train_test_split

C = 1  # SVM regularization parameter
X_train, X_test, y_train, y_test = train_test_split(scaledTrainX, FinalTrainY, test_size=0.4, random_state=0)

svc = svm.SVC(kernel='linear', C=C).fit(X_train, y_train)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X_train, y_train)
poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X_train, y_train)
lin_svc = svm.LinearSVC(C=C).fit(X_train, y_train)

print svc.score(X_test, y_test)
print rbf_svc.score(X_test, y_test)
print poly_svc.score(X_test, y_test)
print lin_svc.score(X_test, y_test)

print"The cross validation scores are: {0}".format(cross_val_score(lin_svc, scaledTrainX, FinalTrainY))
lsvc_predicted = cross_val_predict(lin_svc, scaledTrainX, FinalTrainY)
print "Precision: {0}".format(metrics.precision_score(FinalTrainY, lsvc_predicted))
print "Recall: {0}".format(metrics.recall_score(FinalTrainY, lsvc_predicted))
print "The F1 score is: {0}".format(metrics.f1_score(FinalTrainY, lsvc_predicted))

0.792717086835
0.770308123249
0.764705882353
0.803921568627
The cross validation scores are: [ 0.78787879  0.81481481  0.80808081]
Precision: 0.75786163522
Recall: 0.704678362573
The F1 score is: 0.730303030303

In [70]:

lin_svc = svm.LinearSVC(C=C).fit(scaledTrainX, FinalTrainY)
lin_svcpred = lin_svc.predict(scaledTestX)
lin_svcpred

Out[70]:

array([ 0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,
        0.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,
        0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,
        1.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,
        0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,
        1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,
        1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,
        1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,
        0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  1.,
        0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,
        1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,
        0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,
        1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,
        0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,
        0.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,
        0.,  0.])

In [71]:

svcresults = pd.DataFrame(FinalTest["PassengerId"])
svcresults["Survived"] = lin_svcpred.astype(int)
svcresults.head(5)

Out[71]:

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1

In [72]:

svcresults.to_csv("/Titanic Kaggle/svcResults.csv", index = False)

Scored .75598¶

In [73]:

from tpot import TPOTClassifier

pipeline_optimizer = TPOTClassifier(generations=50, population_size=25, num_cv_folds=5, random_state=42, verbosity=2)

In [74]:

pipeline_optimizer.fit(FinalTrainX, FinalTrainY)
print(pipeline_optimizer.score(FinalTrainX, FinalTrainY))

GP Progress:   2%|▏         | 25/1275 [00:02<33:28,  1.61s/pipeline]

Generation 1 - Current best internal CV score: 0.799677082422

GP Progress:   4%|▍         | 48/1275 [00:00<32:49,  1.60s/pipeline]

Generation 2 - Current best internal CV score: 0.808901681353

GP Progress:   6%|▌         | 71/1275 [00:00<50:08,  2.50s/pipeline]

Generation 3 - Current best internal CV score: 0.808901681353

GP Progress:   8%|▊         | 97/1275 [00:00<1:13:03,  3.72s/pipeline]

Generation 4 - Current best internal CV score: 0.809810772262

GP Progress:   9%|▉         | 120/1275 [00:00<26:16,  1.37s/pipeline]

Generation 5 - Current best internal CV score: 0.809810772262

GP Progress:  12%|█▏        | 148/1275 [00:00<26:21,  1.40s/pipeline]

Generation 6 - Current best internal CV score: 0.809810772262

GP Progress:  14%|█▎        | 174/1275 [00:00<20:13,  1.10s/pipeline]

Generation 7 - Current best internal CV score: 0.815118250005

GP Progress:  16%|█▌        | 201/1275 [04:57<19:14,  1.07s/pipeline]

Generation 8 - Current best internal CV score: 0.815118250005

GP Progress:  18%|█▊        | 224/1275 [00:00<33:07,  1.89s/pipeline]

Generation 9 - Current best internal CV score: 0.815118250005

GP Progress:  19%|█▉        | 248/1275 [00:00<50:03,  2.92s/pipeline]

Generation 10 - Current best internal CV score: 0.815118250005

GP Progress:  21%|██▏       | 273/1275 [00:00<22:52,  1.37s/pipeline]

Generation 11 - Current best internal CV score: 0.815118250005

GP Progress:  23%|██▎       | 296/1275 [00:00<22:08,  1.36s/pipeline]

Generation 12 - Current best internal CV score: 0.815118250005

GP Progress:  25%|██▌       | 323/1275 [00:00<21:20,  1.34s/pipeline]

Generation 13 - Current best internal CV score: 0.815118250005

GP Progress:  27%|██▋       | 348/1275 [00:00<25:29,  1.65s/pipeline]

Generation 14 - Current best internal CV score: 0.815118250005

GP Progress:  29%|██▉       | 370/1275 [00:00<21:07,  1.40s/pipeline]

Generation 15 - Current best internal CV score: 0.815118250005

GP Progress:  31%|███       | 397/1275 [00:00<35:23,  2.42s/pipeline]

Generation 16 - Current best internal CV score: 0.815118250005

GP Progress:  33%|███▎      | 423/1275 [00:00<28:48,  2.03s/pipeline]

Generation 17 - Current best internal CV score: 0.815118250005

GP Progress:  35%|███▌      | 448/1275 [00:00<34:20,  2.49s/pipeline]

Generation 18 - Current best internal CV score: 0.815118250005

GP Progress:  37%|███▋      | 472/1275 [00:00<37:07,  2.77s/pipeline]

Generation 19 - Current best internal CV score: 0.815118250005

GP Progress:  39%|███▉      | 498/1275 [00:00<20:31,  1.58s/pipeline]

Generation 20 - Current best internal CV score: 0.815350163926

GP Progress:  41%|████      | 521/1275 [00:00<18:49,  1.50s/pipeline]

Generation 21 - Current best internal CV score: 0.815350163926

GP Progress:  43%|████▎     | 548/1275 [00:00<16:29,  1.36s/pipeline]

Generation 22 - Current best internal CV score: 0.815350163926

GP Progress:  45%|████▍     | 572/1275 [00:00<1:26:28,  7.38s/pipeline]

Generation 23 - Current best internal CV score: 0.815350163926

GP Progress:  47%|████▋     | 598/1275 [00:00<15:02,  1.33s/pipeline]

Generation 24 - Current best internal CV score: 0.815350163926

GP Progress:  49%|████▉     | 624/1275 [00:00<26:05,  2.40s/pipeline]

Generation 25 - Current best internal CV score: 0.815350163926

GP Progress:  51%|█████     | 649/1275 [00:00<16:41,  1.60s/pipeline]

Generation 26 - Current best internal CV score: 0.815350163926

GP Progress:  53%|█████▎    | 673/1275 [00:00<13:48,  1.38s/pipeline]

Generation 27 - Current best internal CV score: 0.815350163926

GP Progress:  55%|█████▍    | 697/1275 [00:00<13:21,  1.39s/pipeline]

Generation 28 - Current best internal CV score: 0.815350163926

GP Progress:  57%|█████▋    | 722/1275 [00:00<17:04,  1.85s/pipeline]

Generation 29 - Current best internal CV score: 0.815350163926

GP Progress:  59%|█████▊    | 749/1275 [00:00<11:53,  1.36s/pipeline]

Generation 30 - Current best internal CV score: 0.815350163926

GP Progress:  61%|██████    | 772/1275 [00:00<13:26,  1.60s/pipeline]

Generation 31 - Current best internal CV score: 0.815350163926

GP Progress:  63%|██████▎   | 798/1275 [00:00<12:27,  1.57s/pipeline]

Generation 32 - Current best internal CV score: 0.815350163926

GP Progress:  64%|██████▍   | 822/1275 [00:00<11:48,  1.56s/pipeline]

Generation 33 - Current best internal CV score: 0.815350163926

GP Progress:  66%|██████▋   | 847/1275 [00:00<28:53,  4.05s/pipeline]

Generation 34 - Current best internal CV score: 0.815350163926

GP Progress:  69%|██████▊   | 874/1275 [00:00<08:55,  1.34s/pipeline]

Generation 35 - Current best internal CV score: 0.815350163926

GP Progress:  71%|███████   | 899/1275 [00:00<08:14,  1.32s/pipeline]

Generation 36 - Current best internal CV score: 0.815350163926

GP Progress:  72%|███████▏  | 921/1275 [00:00<34:48,  5.90s/pipeline]

Generation 37 - Current best internal CV score: 0.815350163926

GP Progress:  74%|███████▍  | 948/1275 [00:00<07:24,  1.36s/pipeline]

Generation 38 - Current best internal CV score: 0.815350163926

GP Progress:  76%|███████▌  | 972/1275 [00:00<08:09,  1.62s/pipeline]

Generation 39 - Current best internal CV score: 0.815350163926

GP Progress:  78%|███████▊  | 1000/1275 [00:00<33:08,  7.23s/pipeline]

Generation 40 - Current best internal CV score: 0.815350163926

GP Progress:  80%|████████  | 1024/1275 [00:00<11:05,  2.65s/pipeline]

Generation 41 - Current best internal CV score: 0.815350163926

GP Progress:  82%|████████▏ | 1048/1275 [00:00<06:32,  1.73s/pipeline]

Generation 42 - Current best internal CV score: 0.815350163926

GP Progress:  84%|████████▍ | 1071/1275 [00:00<04:33,  1.34s/pipeline]

Generation 43 - Current best internal CV score: 0.815350163926

GP Progress:  86%|████████▌ | 1095/1275 [00:00<10:44,  3.58s/pipeline]

Generation 44 - Current best internal CV score: 0.815350163926

GP Progress:  88%|████████▊ | 1125/1275 [00:00<03:40,  1.47s/pipeline]

Generation 45 - Current best internal CV score: 0.815350163926

GP Progress:  90%|████████▉ | 1145/1275 [00:00<04:19,  2.00s/pipeline]

Generation 46 - Current best internal CV score: 0.815350163926

GP Progress:  92%|█████████▏| 1174/1275 [00:00<01:37,  1.04pipeline/s]

Generation 47 - Current best internal CV score: 0.815350163926

GP Progress:  94%|█████████▍| 1199/1275 [00:00<02:30,  1.98s/pipeline]

Generation 48 - Current best internal CV score: 0.815350163926

GP Progress:  96%|█████████▌| 1223/1275 [00:00<01:26,  1.66s/pipeline]

Generation 49 - Current best internal CV score: 0.815350163926

GP Progress:  98%|█████████▊| 1249/1275 [00:00<00:54,  2.10s/pipeline]

Generation 50 - Current best internal CV score: 0.815350163926

Best pipeline: GradientBoostingClassifier(ZeroCount(StandardScaler(input_matrix)), 0.23999999999999999, 0.69000000000000006)
0.959999041319

In [75]:

print(pipeline_optimizer.score(FinalTrainX, FinalTrainY))
pipeline_optimizer.export('/Titanic Kaggle/tpot_Titanic_pipeline.py')

0.959999041319

In [76]:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from tpot.operators.preprocessors import ZeroCount

exported_pipeline = make_pipeline(
    StandardScaler(),
    ZeroCount(),
    GradientBoostingClassifier(learning_rate=0.24, max_features=0.24, n_estimators=500)
)

exported_pipeline.fit(FinalTrainX, FinalTrainY)
Tpotresults = exported_pipeline.predict(FinalTestX)

In [79]:

finalTpotresults = pd.DataFrame(FinalTest["PassengerId"])
finalTpotresults["Survived"] = Tpotresults.astype(int)
finalTpotresults.head(5)

Out[79]:

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1

In [80]:

finalresults.to_csv("/Titanic Kaggle/Tpot.csv", index = False)