Many times, when we are working on classification problems in Data Science, we blindly perform Random Sampling and split the dataset into train and test sets. For the uninitiated, a random sampling is a technique where all observations have equal probability of getting selected.
What could possibly go wrong with Random Sampling when creating models? Let's say you wanted to predict which kind of customers might default on their credit card bills. If the whole dataset had 65% male and 35% female distribution,but your test set has a distribution of 55% females, your model validation will not be robust. You must believe that in future as well your male and female distribution should be similar to 65:35 and then validate your models.
So, the whole point of sampling is that you want your test set to mimic your train set in terms of similar distribution to make your models as robust as possible. Scenarios in which you want your categorical features to have a similar distribution in both your train and test sets, you should opt for Stratified Sampling. Stratified comes from word Strata which means classes.
1. Dataset Download
We will utilize the UCI machine learning repo which has a dataset on credit card default for customers in Taiwan. This dataset is also available in Kaggle. Metadata about this dataset is available on the respective websites. To follow this post, it is recommended to download the dataset from Kaggle. Most of the features are self explanatory.
2. Dataset Understanding
There is some demographic information of these customers like gender, marital status, education,age. Based on your education, marital status, income status(not present in data) your credit limit may vary. Apart from that, you have information on the billed amount every month and amount paid by customer each month. There are some derived features like PAY_0, PAY_1,...PAY_6.
For each month, these variables track whether payment was done duly or if there was a delay of 1 month, 2 months..9 month and over in making payments. For this post, we will sample the dataset using default.payment.next.month. This feature tells whether someone defaulted on their credit card payment or not.
3. Dataset Exploration
Let's import some libraries and load the dataset into Python using Pandas. We will also identify the target feature for which Stratified Sampling needs to be done.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
##Loading the dataset
df = pd.read_csv("../UCI_Credit_Card.csv")
target_col = 'default.payment.next.month'
Next, we will look at distribution of target column in the dataset.
df[target_col].values_count()
print(f"Dataset has {sum(df[target_col]==1)/df.shape[0] *100:.2f}% Defaulters")
############Output##############
0 23364
1 6636
Name: default.payment.next.month, dtype: int64
Dataset has 22.12% Defaulters
###############################
Let's split the dataset now into train and test sets. We will assume test set to be 30% and train 70% of the dataset. We will use Stratified Sampling. This will pick up random samples from each of the two groups of 0 and 1 within default.payment.next.month feature.
test_size = 0.3
random_state = 1234
df_train,df_test = train_test_split(df, test_size=test_size, random_state = random_state, stratify = df[target_col])
print(f"{df_test.shape[0]/df.shape[0]*100:.2f}% is the test split and {df_train.shape[0]/df.shape[0]*100:.2f}% is the train split")
###OUTPUT##########
30.00% is the test split and 70.00% is the train split
###############
Notice that by specifying the random state, you can reproduce the train and test sets. By providing a seed value before splitting, after executing the train_test_split function even for a million times, you will always get the same train and test set. Always advisable to set a seed number/random state for reproducing your results.
We would also like to check if the distribution of target feature remain similar in train and test set. Let's see this in action.
print(f"Full Dataset has {sum(df[target_col]==1)/df.shape[0] *100:.2f}% Defaulters \nTrain Dataset has {sum(df_train[target_col]==1)/df_train.shape[0] *100:.2f}% Defaulters \nTest Dataset has {sum(df_test[target_col]==1)/df_test.shape[0] *100:.2f}% Defaulters")
#######Output#############################
Full Dataset has 22.12% Defaulters
Train Dataset has 22.12% Defaulters
Test Dataset has 22.12% Defaulters
##########################################
We have a similar distribution for the target feature in train and test set. Our models should now have no problems in learning patterns. This concludes our post on Stratified Sampling.
Comments
Post a Comment