Name: Guillaume De Gani
The purpose of this notebook is to study the fall detection using floor sensors and to select the right model to predict wether or not the subject fell which in the future could help elders in their daily lives.
obs | raw_feat_X1 | raw_feat_X2 | raw_feat_X3 | raw_feat_X4 | raw_feat_X5 | raw_feat_X6 | raw_feat_X7 | raw_feat_X8 | raw_feat_X9 | ... | deriv_feat_X21 | deriv_feat_X22 | deriv_feat_X23 | deriv_feat_X24 | deriv_feat_X25 | deriv_feat_X26 | deriv_feat_X27 | deriv_feat_X28 | deriv_feat_X29 | FALL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.249744 | -0.162770 | 0.223727 | 0.393904 | -0.154366 | 0.128968 | 1.090661 | 0.913849 | 0.505526 | ... | 0.121241 | 0.734862 | 0.179370 | 0.402461 | 0.638393 | 0.344236 | 0.823239 | -0.409350 | 1.425206 | 1 |
1 | 1 | 0.385843 | -0.660978 | -0.127798 | -0.205710 | -0.160936 | 0.111606 | 0.171391 | 2.889781 | 0.377333 | ... | -0.182778 | 0.357499 | -0.056181 | 0.840313 | 0.605672 | 0.655029 | 1.052671 | -0.177353 | 1.613721 | 1 |
2 | 2 | 3.344528 | -4.535931 | 0.165140 | -0.228745 | 3.203818 | 3.379462 | 1.089901 | 2.097552 | 0.877990 | ... | 0.425260 | 8.093449 | -0.684318 | 0.046744 | 3.440332 | 3.965586 | 2.916183 | 0.367674 | 3.952479 | 1 |
3 | 3 | 3.190676 | -2.884463 | -1.153080 | -0.698292 | 1.868221 | 2.493077 | 2.546198 | 3.817391 | 3.711000 | ... | -4.743065 | -0.774592 | -1.076903 | -0.818687 | 3.572430 | 3.409429 | 2.407953 | 1.233629 | 2.702845 | 1 |
4 | 4 | 2.338575 | -2.699941 | -0.069211 | -0.025849 | 1.420714 | 2.137326 | 1.097388 | 2.101987 | 1.200319 | ... | -2.766941 | 0.168817 | -1.116162 | -1.640847 | 1.527936 | 2.215856 | 2.353429 | 0.721413 | 2.933661 | 1 |
5 rows × 89 columns
As you can see above, the dataset has 87 features and the target feature "Fall". The 87 features are seperated in 3 categories each going from X1 to X29:
FALL
: The label of the data wich indicates wether or not the person fellraw
: The raw data extracted from the sensorsderiv
: The derivative of the data from the sensorsfft
: The energy of the signalsWe start by initializing the data we need for the diffent models, to do so the data is split it two stets:
X
: Which is the feature matrixY
: Which is the label vector After splitting the data in two sets X
& Y
we can apply train_test_split
which is a quick and easy way to split the data in a training set and a testing set to properly check the accuracy of our model. In this case we set aside 20%
of the data to test it after fitting the various models.
The data count is the following:
2619
instances of label 0 representing a Not Fall202
instances of label 1 representing a FallBefore starting the analyses it is important to note that the data is fairly imbalanced i.e only 7,16% of the data represents a fall. Which means that using accuracy is unwise since the model could easily reach 93% accuracy by labeling all data as 0.
For this reason it's preferable to use f1
score to only measure true positive and avoid these issues.
In this section different model will be tested and for each classifier a confusion matrix will be ploted which gives important information notably on the true positive rate TPR
of the model which is the most important metrics when studying imablanced data.
Here a list of the models that were tested and compared:
Random Forest
Support Vector Machine
Naive Bayesian
Decision Tree
K-Nearest Neighboors
AdaBoost
Gradient Boosting
Bagging
The first model that is tested is Random Forest were we plot the confusion matrix which is helpfull to see how accurate the model list by checking that the diagonal as high values.
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1f6b72f6400>
The result for the first model gives an F1
score of 93.3% and a precision of 95%. These result are promissing with the amount of data that was used for the training.
The next model that was tested was a support vector machine with three different kernels:
rbf
: Exponential kernelpoly
: Second degree polynomial kernellinear
: Linear kernelAnd once again all the models have fairly high Accuracy
and F1
score however this was done without cross validation so the sanity of the data selected migh be the reason for these high value.
The next plot represent the confusion matrix for another set of classifier once again just to do a quick sanity check and see if any of them give absurd results.
Text(0, 0.5, '')
Every confusion matrix as a diagonal that is close to one which indecates that these model are fairly accurate.
After testing the diffrent model to check if any of them were giving absurd results it's important to compare them to see which model performs the best. To do so it's important to find diffrent ways to visulize the scoring parameter of the classifier tested.
Receiver Operator Characteristic curve (ROC)
In this part the ROC
is used to evaluate the ability of the binary classfier to find true possitives. In the bottom left corner of the graphs it's possible to compare the diffrent AUC
or Area Under the Curve is equivalent to the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.
The closer it is to 1 the better the model is at classifying the data. In this case every model as an AUC
above 0.90
which seems very promising.
K-Fold cross validation
Previously the model were evaluated without using cross validation, to get better results it's best practice to measure the diffrent metrics
by averaging on a K-Fold.
The following data is optained using 10-folds.
In the table below we can compare diffrent scoring metrics for each model
this table seems to indicate that the better performing model is Random Forest
regarding the f1
score is random forest.
However the goal of this technologie is to dectect wether or not a person fell and it's important to ask ourself how dangerous it is to not detect a fall. That's why if the objective is to minimize this error than Naive Baiysian
has the highest fall detection accuracy.
Classifier | Average F1 score | Accuracy of Fall detection |
Random Forest | 0.93 ± 0.03 | 0.90 |
SVM using rbf | 0.93 ± 0.03 | 0.88 |
SVM using poly | 0.92 ± 0.04 | 0.86 |
SVM using linear | 0.92 ± 0.04 | 0.91 |
Naive Baiysian | 0.78 ± 0.04 | 0.94 |
Decision Tree | 0.88 ± 0.06 | 0.88 |
Logistic Regression | 0.93 ± 0.03 | 0.90 |
KNN | 0.91 ± 0.04 | 0.87 |
Ada Boost | 0.90 ± 0.03 | 0.87 |
Gradient Boosting | 0.91 ± 0.03 | 0.89 |
Bagging | 0.92 ± 0.04 | 0.87 |
Another way to visulize this is using Bar Plots, before plotting the data we sort each classifier by the mean of their f1
score.
<ipython-input-18-4d6ce3cd5c3b>:6: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(new_list)
This plot confirms our previous assesment that overall Random Forest
seem to be better performing than the other models regarding the f1 score. Even though most perform in a similar manner.
The data given was fairly imbalanced towards the first label which represents the patient not falling. This made it challenging to find a model that was performing well overall and that wasn't missing to many falls which if applied in real life could be problematic.
It seem like the better peforming model overall is Random forest
and to have a high level of precision regarding Fall detection
then Naive Baiysian
give good results.
Finally the question can be asked regarding is using transfer learning since this data was optained by using fairly young test subjects age 25
to 45
and it's supposed to be applied for seniors e.g above 65
. It's likely that their walking patern and falling patern could differ slightly like in their speed for example. And for obvious reason getting data by asking seniors to fall
on purpose seems highly unethical.
Another question that can be asked is the impacts of walking aids like canes and walkers on the data receveid by the floor sensors. This could be an important factor to consider since in the US 16.4%
of seniors use a cane and 11.6%
use a walker.