One Of My First Experiences With Kaggle Data – My Journey into Data Science and Machine Learning

After taking several Data Science and Machine Learning online courses (for more on this see my previous post) in January 2020 I decided to create my own projects with real-world data. I had already decided that I will be using Random Forest Classifier (RFC) and was looking for suitable data set to use for my project. I had signed up with Kaggle couple of weeks ago and did a “Classification data” search there. One of the data sets which caught my attention was “Toddler Autism dataset July 2018.csv” (https://www.kaggle.com/fabdelja/autism-screening-for-toddlers).

The data had a usability rating of 5.9. The description as posted below appealed to me and I decided to try it with RFC

Data set description from Kaggle:

The data set was developed by Dr Fadi Fayez Thabtah (fadifayez.com) using a mobile app called ASDTests (ASDtests.com) to screen autism in toddlers. See the description file attached with the CSV data to know more about the variables and the class. This data can be used for descriptive and predictive analyses such as classification, clustering, regression, etc. You may use it to estimate the predictive power of machine learning techniques in detecting autistic traits.

Below I am not going to provide the complete code of my project – I will provide only snippets of code as necessary. The goal of my post is to draw attention to the data set itself and to emphasize the need for critical thinking (especially for beginners) when using unknown data sets.

Usually, in my projects I follow the steps outlined below:

Brief comment section with project’s description
Import libraries
Read data
Perform Exploratory Data Analysis (EDA)
Data preprocessing
- Impute nulls if applicable
- Create dummy variables for categorical variables
- Select features, X, and a target, y
- Split the above in Train and Test sets
- Scale if necessary
Create and apply model
Examine results
Depending on results, if necessary apply cross validation and/or hyper parameter tuning

For this particular data set, however, I wanted to obtain the results quickly and to take a look at them, so I decided to skip the EDA step.

Here is roughly how the project progressed:

1) Import libraries (code below)

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

sns.set(style=”whitegrid”, font_scale=1.5)

2) Read data

data = pd.read_csv(‘Toddler Autism dataset July 2018.csv’)

data.head()

data.info()

Output:

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 1054 entries, 0 to 1053

Data columns (total 19 columns):

Case_No 1054 non-null int64

A1 1054 non-null int64

A2 1054 non-null int64

A3 1054 non-null int64

A4 1054 non-null int64

A5 1054 non-null int64

A6 1054 non-null int64

A7 1054 non-null int64

A8 1054 non-null int64

A9 1054 non-null int64

A10 1054 non-null int64

Age_Mons 1054 non-null int64

Qchat-10-Score 1054 non-null int64

Sex 1054 non-null object

Ethnicity 1054 non-null object

Jaundice 1054 non-null object

Family_mem_with_ASD 1054 non-null object

Who completed the test 1054 non-null object

Class/ASD Traits 1054 non-null object

dtypes: int64(13), object(6)

memory usage: 156.6+ KB

3) Data preprocessing

There were no nulls. Quick examination showed that the sum of columns A1 to A10 gives the value of the Qchat-10-Score column, so all A-columns were dropped. In addition, the columns “Case_No”, “Ethnicity”, and “Who completed the test” were also dropped.

After that, convert all categorical variables – ‘Sex’, ‘Jaundice’, ‘Family_mem_with_ASD’, ‘Class/ASD Traits’ – to dummy numeric variables and replaced the categorical variables in the data with the new numerical variables.

Code:

dummy_sex = pd.get_dummies(data[‘Sex’], drop_first = True) # dropping one of the resulting columns to avoid co-linearity

dummy_ja = pd.get_dummies(data[‘Jaundice’], drop_first = True)

dummy_fam = pd.get_dummies(data[‘Family_mem_with_ASD’], drop_first = True)

dummy_outcome = pd.get_dummies(data[‘Class/ASD Traits ‘], drop_first = True)

data = pd.concat([data.iloc[:, 0:2], dummy_sex, dummy_ja, dummy_fam, dummy_out come], axis = 1)

4) Select features, X, and target, y

X = data.iloc[:, :-1].values # all data columns but last

y = data.iloc[:, -1].values # last data column Class/ASD Traits

5) Split in Train and Test sets

# no scaling necessary, so split in train/test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

6) Create model, fit, predict

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators = 300, n_jobs = -1, random_state = 0)

rfc.fit(X_train, y_train)

y_pred_1 = rfc.predict(X_test)

7) Compare predictions with y_test

from sklearn.metrics import confusion_matrix, classification_report

print(‘Confusion Matrix:’) print(confusion_matrix(y_test, y_pred_1))

print(‘\n’)

print(‘Classification Report:’)

print(classification_report(y_test, y_pred_1))

Output:

This was probably the first time I had 100% accuracy in any of my projects to date and it raised a big red flag. I gave the results the benefit of the doubt as a random lucky occurrence and decided to run cross validation.

Code:

from sklearn.model_selection import cross_val_score

all_accuracies = cross_val_score(estimator = rfc, X = X, y = y, scoring = ‘f1_ macro’, cv = 10)

print(‘All Accuracies:’)

print(all_accuracies)

print(‘\n’)

print(‘mean accuracy: ‘, round(np.mean(all_accuracies), 3))

print(‘std: ‘, round(np.std(all_accuracies), 3))

Output:

All Accuracies: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

mean accuracy: 1.0

std: 0.0

This was clearly not a normal outcome with all 10 random runs providing 100% accuracy with zero variance. My only explanation at this point was that there is something wrong with the data. I decided to run one more check to see which of the variables had the highest importance score.

Code:

feature_imp = pd.Series(rfc.feature_importances_,index = data.iloc[:, :-1].columns).sort_values(ascending=False)

feature_imp

Output:

Qchat-10-Score 0.963038

Age_Mons 0.024048

m 0.007382

yes 0.003743

yes 0.001790

dtype: float64

The result revealed that the outcome is practically determined by Qchat-10-Score only. So, I decided to take a look at what were the Qchat-10-Score values for kids with and without autism and got the following:

Any toddler with a score of four and above had autism; all kids with score below four did not. It was no wonder that the outcome was determined by this variable only and the model accuracy was 100%. With this type of data one would not even need a model – the outcome is quite obvious just by looking at the above plot.

That’s when I decided to do a quick research on the tests for autism in toddlers in the medical literature. I could not find tests using the Qchat-10-Score. However, my search returned Qchat test with 25 items as a standard test (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484636/ for example). I assume that Qchat-10 is a subset of the Qchat-25-score. The research articles pointed out that the histogram of the Qchat-25-score in autistic kids follows normal distribution. The distribution for non-autistic children is also normal with the two distributions overlapping.

Example of Qchat-score distribution from above source:

Thus, I decided to plot the histogram of the Qchat-10-score from the data. Assuming that Qchat-10-score is a subset of the general Qchat-25-score, the expectation that the result from this data set should also be close to normal distribution is justified.

However, as already expected from the previous plot of the Qchat-10-score, none of the histograms resembled normal distribution and there was no overlap. Thus, from the results from my analysis the only reasonable conclusion was that the data set “Toddler Autism dataset July 2018.csv” posted on Kaggle is not valid. If this is a typical real-world data, then it would be quite easy to detect autism in kids. And in reality it is not.

I did not contact the Kaggle team at the time. I still haven’t done that. There was only one discussion on Kaggle about this data set with somebody posting a question which features should be used in the model and whether one should expect 100% accuracy. The author seemingly did not understand the question because the answer he provided was inadequate. There was no mention of the possibility of the data being invalid. At that time, I was new to Kaggle and did not know how to approach the issue. I did not want to offend the contributor and, possibly, get into drawn out back and forth arguments about the validity of the data and the validity of my analyses, even though I was confident in them.

Three months later and after my decision to start a blog, I decided to share my findings here to alert other Data Science enthusiasts about the possibility of data from various sources being invalid.

In conclusion, I would like to make several points in regard with using data from different sources:

People contributing to Kaggle or to any other data repository should be responsible to make sure that the data provided by them is accurate/valid.
It would be great if there were some mechanism in place for vetting whether the data is valid or not before being posted. I know that this is probably impossible to put in place, so, maybe, the first point is what it comes down to.
To anybody who is new in the field of Data Science and Machine Learning do not assume that because you have downloaded the data from Kaggle it is fine. Use critical thinking. If the results of your analysis look too good to be true or don’t seem to make sense, dig deeper, do more research until you are satisfied with your answers. If you are not satisfied with your answers and still have doubts, ask other more experienced people, post the question on a forum, …

Good luck!

Related Post