After taking several Data Science and Machine Learning online courses (for more on this see my previous post) in January 2020 I decided to create my own projects with real-world data. I had already decided that I will be using Random Forest Classifier (RFC) and was looking for suitable data set to use for my project. I had signed up with Kaggle couple of weeks ago and did a “Classification data” search there. One of the data sets which caught my attention was “Toddler Autism dataset July 2018.csv” (https://www.kaggle.com/fabdelja/autism-screening-for-toddlers).
The data had a usability rating of 5.9. The description as posted below appealed to me and I decided to try it with RFC
Data set description from Kaggle:
- The data set was developed by Dr Fadi Fayez Thabtah (fadifayez.com) using a mobile app called ASDTests (ASDtests.com) to screen autism in toddlers. See the description file attached with the CSV data to know more about the variables and the class. This data can be used for descriptive and predictive analyses such as classification, clustering, regression, etc. You may use it to estimate the predictive power of machine learning techniques in detecting autistic traits.
Below I am not going to provide the complete code of my project – I will provide only snippets of code as necessary. The goal of my post is to draw attention to the data set itself and to emphasize the need for critical thinking (especially for beginners) when using unknown data sets.
Usually, in my projects I follow the steps outlined below:
- Brief comment section with project’s description
- Import libraries
- Read data
- Perform Exploratory Data Analysis (EDA)
- Data preprocessing
- Impute nulls if applicable
- Create dummy variables for categorical variables
- Select features, X, and a target, y
- Split the above in Train and Test sets
- Scale if necessary
- Create and apply model
- Examine results
- Depending on results, if necessary apply cross validation and/or hyper parameter tuning
For this particular data set, however, I wanted to obtain the results quickly and to take a look at them, so I decided to skip the EDA step.
Here is roughly how the project progressed:
1) Import libraries (code below)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style=”whitegrid”, font_scale=1.5)
2) Read data
data = pd.read_csv(‘Toddler Autism dataset July 2018.csv’)
data.head()
data.info()
Output:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 1054 entries, 0 to 1053
Data columns (total 19 columns):
Case_No 1054 non-null int64
A1 1054 non-null int64
A2 1054 non-null int64
A3 1054 non-null int64
A4 1054 non-null int64
A5 1054 non-null int64
A6 1054 non-null int64
A7 1054 non-null int64
A8 1054 non-null int64
A9 1054 non-null int64
A10 1054 non-null int64
Age_Mons 1054 non-null int64
Qchat-10-Score 1054 non-null int64
Sex 1054 non-null object
Ethnicity 1054 non-null object
Jaundice 1054 non-null object
Family_mem_with_ASD 1054 non-null object
Who completed the test 1054 non-null object
Class/ASD Traits 1054 non-null object
dtypes: int64(13), object(6)
memory usage: 156.6+ KB
3) Data preprocessing
There were no nulls. Quick examination showed that the sum of columns A1 to A10 gives the value of the Qchat-10-Score column, so all A-columns were dropped. In addition, the columns “Case_No”, “Ethnicity”, and “Who completed the test” were also dropped.
After that, convert all categorical variables – ‘Sex’, ‘Jaundice’, ‘Family_mem_with_ASD’, ‘Class/ASD Traits’ – to dummy numeric variables and replaced the categorical variables in the data with the new numerical variables.
Code:
dummy_sex = pd.get_dummies(data[‘Sex’], drop_first = True) # dropping one of the resulting columns to avoid co-linearity
dummy_ja = pd.get_dummies(data[‘Jaundice’], drop_first = True)
dummy_fam = pd.get_dummies(data[‘Family_mem_with_ASD’], drop_first = True)
dummy_outcome = pd.get_dummies(data[‘Class/ASD Traits ‘], drop_first = True)
data = pd.concat([data.iloc[:, 0:2], dummy_sex, dummy_ja, dummy_fam, dummy_out come], axis = 1)
4) Select features, X, and target, y
X = data.iloc[:, :-1].values # all data columns but last
y = data.iloc[:, -1].values # last data column Class/ASD Traits
5) Split in Train and Test sets
# no scaling necessary, so split in train/test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
6) Create model, fit, predict
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 300, n_jobs = -1, random_state = 0)
rfc.fit(X_train, y_train)
y_pred_1 = rfc.predict(X_test)
7) Compare predictions with y_test
from sklearn.metrics import confusion_matrix, classification_report
print(‘Confusion Matrix:’) print(confusion_matrix(y_test, y_pred_1))
print(‘\n’)
print(‘Classification Report:’)
print(classification_report(y_test, y_pred_1))
Output:
This was probably the first time I had 100% accuracy in any of my projects to date and it raised a big red flag. I gave the results the benefit of the doubt as a random lucky occurrence and decided to run cross validation.
Code:
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator = rfc, X = X, y = y, scoring = ‘f1_ macro’, cv = 10)
print(‘All Accuracies:’)
print(all_accuracies)
print(‘\n’)
print(‘mean accuracy: ‘, round(np.mean(all_accuracies), 3))
print(‘std: ‘, round(np.std(all_accuracies), 3))
Output:
All Accuracies: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
mean accuracy: 1.0
std: 0.0
This was clearly not a normal outcome with all 10 random runs providing 100% accuracy with zero variance. My only explanation at this point was that there is something wrong with the data. I decided to run one more check to see which of the variables had the highest importance score.
Code:
feature_imp = pd.Series(rfc.feature_importances_,index = data.iloc[:, :-1].columns).sort_values(ascending=False)
feature_imp
Output:
Qchat-10-Score 0.963038
Age_Mons 0.024048
m 0.007382
yes 0.003743
yes 0.001790
dtype: float64
The result revealed that the outcome is practically determined by Qchat-10-Score only. So, I decided to take a look at what were the Qchat-10-Score values for kids with and without autism and got the following:
Any toddler with a score of four and above had autism; all kids with score below four did not. It was no wonder that the outcome was determined by this variable only and the model accuracy was 100%. With this type of data one would not even need a model – the outcome is quite obvious just by looking at the above plot.
That’s when I decided to do a quick research on the tests for autism in toddlers in the medical literature. I could not find tests using the Qchat-10-Score. However, my search returned Qchat test with 25 items as a standard test (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484636/ for example). I assume that Qchat-10 is a subset of the Qchat-25-score. The research articles pointed out that the histogram of the Qchat-25-score in autistic kids follows normal distribution. The distribution for non-autistic children is also normal with the two distributions overlapping.
Example of Qchat-score distribution from above source:
Thus, I decided to plot the histogram of the Qchat-10-score from the data. Assuming that Qchat-10-score is a subset of the general Qchat-25-score, the expectation that the result from this data set should also be close to normal distribution is justified.
However, as already expected from the previous plot of the Qchat-10-score, none of the histograms resembled normal distribution and there was no overlap. Thus, from the results from my analysis the only reasonable conclusion was that the data set “Toddler Autism dataset July 2018.csv” posted on Kaggle is not valid. If this is a typical real-world data, then it would be quite easy to detect autism in kids. And in reality it is not.
I did not contact the Kaggle team at the time. I still haven’t done that. There was only one discussion on Kaggle about this data set with somebody posting a question which features should be used in the model and whether one should expect 100% accuracy. The author seemingly did not understand the question because the answer he provided was inadequate. There was no mention of the possibility of the data being invalid. At that time, I was new to Kaggle and did not know how to approach the issue. I did not want to offend the contributor and, possibly, get into drawn out back and forth arguments about the validity of the data and the validity of my analyses, even though I was confident in them.
Three months later and after my decision to start a blog, I decided to share my findings here to alert other Data Science enthusiasts about the possibility of data from various sources being invalid.
In conclusion, I would like to make several points in regard with using data from different sources:
- People contributing to Kaggle or to any other data repository should be responsible to make sure that the data provided by them is accurate/valid.
- It would be great if there were some mechanism in place for vetting whether the data is valid or not before being posted. I know that this is probably impossible to put in place, so, maybe, the first point is what it comes down to.
- To anybody who is new in the field of Data Science and Machine Learning do not assume that because you have downloaded the data from Kaggle it is fine. Use critical thinking. If the results of your analysis look too good to be true or don’t seem to make sense, dig deeper, do more research until you are satisfied with your answers. If you are not satisfied with your answers and still have doubts, ask other more experienced people, post the question on a forum, …
Good luck!