Titanic notebook

The purpose of this notebook is to study the titanic dataset and select relevant features in order to predict whether someone survived the shipwreck.

This notebook will use functions from the _titanic module to preprocess the data.

Module and data import

First look at the data

As you can see above, the dataset has 11 features and the target feature "Survived". The 11 features are as following :

Creating Models

We can have a first prediction using three of the features : SibSp, Parch, Fare. Indeed, for the first two features, we can assume that the more family relations a passenger had in the ship, the more likely they were to survive. For the last feature, we can assume that the more expensive the ticket, the wealthier the passenger and the higher the probability of them to survive.

This prediction is far from accurate, but it is a first model upon which we can add features.

In order to choose other features, one method is to use a correlation matrix to see which feature is correlated to survival.

Among the feature not yet selected, the two features with the highest absolute correlation value are Pclass and Age.

Let us study these two features and the impact they have on our current model.

As seen with the previous plot, the Pclass feature has a great impact on whether someone will survive. Indeed, people in the first Pclass are more likely to survive than people in the third Pclass.

Thus, we add Pclass in our model.

Adding the Pclass feature only slightly increase the score.

Firstly, we check whether there are void values.

Since approximately a fifth of the values are null, the first step to use the Age feature is to fill these values.

As seen in the correlation map, age is most correlated with Pclass. Thus, we fille the missing Age values with the median age value of each Pclass.

We can now add the Age feature to the model.

Adding both features hightened significantly the score. However, in the relative distribution of the Age feature, one can see age categories would be more appropriate than simply leaving the age.

Another supposition is that the sex of a passenger had an impact on whether they survived. Women would have been saved more frequently than men. Let us check this assumption.

Considering the disparity of proportion of survivors in the two sexes, the feature will be added to the model.

However, the current Sex feature is not in numerical value. We must create a feature is_male with value 0 if the passenger is female and 1 if male.

Among the remaining unused features, let us check the Name feature, and more particularly, the title of each person.

According to these titles, we can assume passengers with titles "Dr", "Master", or "the Countess" are more likely to be saved because were dimmed more important.

Comparing logistic regression to random forest resulsts

After numerous tries, during which we tried to dummify the selected features and adding others, we were unable to achieve a higher score.

Conclusion

The study of this dataset allowed us to choose features in order to predict the survival of a passenger.

In this study, the following features were chosen :

Among these features, we dummified the features of Sex and Embark and we created categories for the features Age and Name.

Our study lead to a prediction with a score of 85%.