Skip to main content

Standard Normal Distribution with examples using Python

How to use Stratified Sampling using scikit-learn and pandas

Many times, when we are working on classification problems in Data Science, we blindly perform Random Sampling and split the dataset into train and test sets. For the uninitiated, a random sampling is a technique where all observations have equal probability of getting selected.

What could possibly go wrong with Random Sampling when creating models? Let's say you wanted to predict which kind of customers might default on their credit card bills. If the whole dataset had 65% male and 35% female distribution,but your test set has a distribution of 55% females, your model validation will not be robust. You must believe that in future as well your male and female distribution should be similar to 65:35 and then validate your models.

So, the whole point of sampling is that you want your test set to mimic your train set in terms of similar distribution to make your models as robust as possible. Scenarios in which you want your categorical features to have a similar distribution in both your train and test sets, you should opt for Stratified Sampling. Stratified comes from word Strata which means classes.

1. Dataset Download

We will utilize the UCI machine learning repo which has a dataset on credit card default for customers in Taiwan. This dataset is also available in Kaggle. Metadata about this dataset is available on the respective websites. To follow this post, it is recommended to download the dataset from Kaggle. Most of the features are self explanatory.

2. Dataset Understanding

There is some demographic information of these customers like gender, marital status, education,age. Based on your education, marital status, income status(not present in data) your credit limit may vary. Apart from that, you have information on the billed amount every month and amount paid by customer each month. There are some derived features like PAY_0, PAY_1,...PAY_6.

For each month, these variables track whether payment was done duly or if there was a delay of 1 month, 2 months..9 month and over in making payments. For this post, we will sample the dataset using default.payment.next.month. This feature tells whether someone defaulted on their credit card payment or not.

3. Dataset Exploration

Let's import some libraries and load the dataset into Python using Pandas. We will also identify the target feature for which Stratified Sampling needs to be done.


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
##Loading the dataset
df = pd.read_csv("../UCI_Credit_Card.csv")
target_col = 'default.payment.next.month'

Next, we will look at distribution of target column in the dataset.


df[target_col].values_count()
print(f"Dataset has {sum(df[target_col]==1)/df.shape[0] *100:.2f}% Defaulters")

############Output##############
0    23364
1     6636
Name: default.payment.next.month, dtype: int64

Dataset has 22.12% Defaulters
###############################

Let's split the dataset now into train and test sets. We will assume test set to be 30% and train 70% of the dataset. We will use Stratified Sampling. This will pick up random samples from each of the two groups of 0 and 1 within default.payment.next.month feature.


test_size = 0.3
random_state = 1234
df_train,df_test = train_test_split(df, test_size=test_size, random_state = random_state, stratify = df[target_col])

print(f"{df_test.shape[0]/df.shape[0]*100:.2f}% is the test split and {df_train.shape[0]/df.shape[0]*100:.2f}% is the train split")

###OUTPUT##########
30.00% is the test split and 70.00% is the train split

###############

Notice that by specifying the random state, you can reproduce the train and test sets. By providing a seed value before splitting, after executing the train_test_split function even for a million times, you will always get the same train and test set. Always advisable to set a seed number/random state for reproducing your results.

We would also like to check if the distribution of target feature remain similar in train and test set. Let's see this in action.


print(f"Full Dataset has {sum(df[target_col]==1)/df.shape[0] *100:.2f}% Defaulters \nTrain Dataset has {sum(df_train[target_col]==1)/df_train.shape[0] *100:.2f}% Defaulters \nTest Dataset has {sum(df_test[target_col]==1)/df_test.shape[0] *100:.2f}% Defaulters")

#######Output#############################
Full Dataset has 22.12% Defaulters 
Train Dataset has 22.12% Defaulters 
Test Dataset has 22.12% Defaulters
##########################################

We have a similar distribution for the target feature in train and test set. Our models should now have no problems in learning patterns. This concludes our post on Stratified Sampling.

Comments

Popular posts from this blog

How to adopt Embeddings for Categorical features in Tabular Data using PyTorch's nn.Embedding( )-- Part 2

In the previous post , we set up the context to utilize embeddings for categorical features. In this post, we will figure out how to create these embeddings and combine them with other continuous features to build a neural network model. Dataset Download We will utilize the UCI machine learning repo which has a dataset on credit card default for customers in Taiwan. This dataset is also available in Kaggle . Metadata about this dataset is available on the respective websites. To follow this post, it is recommended to download the dataset from Kaggle. Most of the features are self explanatory. Embedding Creation A few definitions first. Levels in a categorical feature represent unique values available for that categorical feature. For e.g. MARRIAGE has levels 0,1,2,3. Each level of a categorical feature is represented by a vector of numbers. So, if you stack up all the levels together and all the vectors together, you can imagine levels to be a colum

How to adopt Embeddings for Categorical features in Tabular Data using PyTorch's nn.Embedding( )-- Part 1

How to adopt Embeddings for Categorical features in Tabular Data using PyTorch's nn.Embedding( )-- Part 1 In this post, we will talk about using embeddings for categorical features using PyTorch. This post will be broken down into following parts. Dataset Download Data Understanding Data Preprocessing Embedding Creation Define Dataset and Dataloaders in PyTorch Neural Network definition in PyTorch The Training Loop Model Validation The idea about using Embeddings from Categorical Features was first mooted during a Kaggle contest and a paper was also published on this. In the context of NLP and word embeddings, we represent each word in an n dimesnional vector space. In a similar way, we can represent any categorical feature in an n dimesnional vector space as well. 1. Dataset Download We will utilize the UCI machine learning repo which has a dataset on credit card default for customers in Taiwan. This dataset is also av

Standard Normal Distribution with examples using Python

Standard Normal Distribution with examples In our previous post, we talked about Normal Distribution and its properties . In this post, we extend those ideas and discuss about Standard Normal Distribution in detail. What is a Standard Normal Distribution? A Normal Distribution with mean 0 and standard deviation 1 is called a Standard Normal Distribution . Mathematicallty, it is given as below. Fig 1:Standard Normal Probability Distribution Function For comparison, have a look at the Normal Probability Distribution Function. If you substitute mean as 0 ,standard deviation as 1, you derive the standard normal probability distribution function Fig 2: Normal Probability Distribution Function Need for a standard normal probability distribution function We need to extract probability information about events that we are interested in. For this, first we need to convert any normal random variable