How to adopt Embeddings for Categorical features in Tabular Data using PyTorch's nn.Embedding( )-- Part 1
In this post, we will talk about using embeddings for categorical features using PyTorch. This post will be broken down into following parts.
- Dataset Download
- Data Understanding
- Data Preprocessing
- Embedding Creation
- Define Dataset and Dataloaders in PyTorch
- Neural Network definition in PyTorch
- The Training Loop
- Model Validation
The idea about using Embeddings from Categorical Features was first mooted during a Kaggle contest and a paper was also published on this. In the context of NLP and word embeddings, we represent each word in an n dimesnional vector space. In a similar way, we can represent any categorical feature in an n dimesnional vector space as well.
1. Dataset Download
We will utilize the UCI machine learning repo which has a dataset on credit card default for customers in Taiwan. This dataset is also available in Kaggle.Metadata about this dataset is available on the respective websites. To follow this post, it is recommended to download the dataset from Kaggle. Most of the features are self explanatory.
2. Dataset Understanding
Let's take a step back and understand the problem that we are trying to solve. As a bank, we are trying to figure out which of my customers are going to default on credit card payments. When you swipe a credit card, the bank makes a payment to the merchant on your behalf. This amount is recovered from you at the end of the month or at the end of your billing cycle. So the bank has given you credit for a duration of time.
If you default on your payments, it is the bank which has to incur the cost of you swiping the credit card. In order to ensure that the bank receives timely payments from you, they will charge a hefty penalty on you, every time you default. Also your credit score will take a hit. So, if tomorrow , you want a home loan, because of your bad credit score you may get a home loan at a higher rate of interest or you may be ineligible as well. So it is in everyone's interest that timely payments are made for the credit card bills.
There is some demographic information of these customers like gender, marital status, education,age. Based on your education, marital status, income status(not present in data) your credit limit may vary. Apart from that, you have information on the billed amount every month and amount paid by customer each month. There are some derived features like PAY_0, PAY_1,...PAY_6.
For each month, these variables track whether payment was done duly or if there was a delay of 1 month, 2 months..9 month and over in making payments. We will consider gender, marital status, education, PAY_0,PAY_1,PAY_2, PAY_3,PAY_4,PAY_5,PAY_6 as categorical features for the purpose of building a model. All other features will be used as numerical features. ID is not used in the model. default.payment.next.month tracks whether someone defaulted in payments in the next month (most likely to be for October 2005). 1 indicates default and 0 indicates no default for this feature.
2. Dataset Processing
First, let's import all necessary modules into Python.
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pickle
from torch.utils.data import Dataset, DataLoader
Next, we will load the dataset and segregate categorical, continous and target features as below.
##Loading the UCI dataset
df = pd.read_csv("../UCI_Credit_Card.csv")
cont_cols = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
cat_cols = ['SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']
target_col = 'default.payment.next.month'
Next, we want to split the data into train as 70% and test set as 30% using Stratified Sampling. We will use train_test_split function from scikit-learn for this.
def data_split(df):
test_size = 0.3
random_state = 1234
df_train,df_test = train_test_split(df, test_size=test_size, random_state = random_state, stratify = df[target_col])
return df_train,df_test
df_train,df_test = data_split(df)
We can make a quick check to see if we have equal proportion of defaulters in train and test sets
print(f"Full Dataset has {sum(df[target_col]==1)/df.shape[0] *100:.2f}% Defaulters \nTrain Dataset has {sum(df_train[target_col]==1)/df_train.shape[0] *100:.2f}% Defaulters \nTest Dataset has {sum(df_test[target_col]==1)/df_test.shape[0] *100:.2f}% Defaulters")
####Output######################
Full Dataset has 22.12% Defaulters
Train Dataset has 22.12% Defaulters
Test Dataset has 22.12% Defaulters
When we split the dataset, we ideally want all levels in the categorical features to be available in the train set. It is fine if some of these levels are not available in the test set. The reason is, we want to learn a set of embeddings for each of these levels from the train set. Let's create a check to determine if any of the categorical features have any levels missing in train set.
def check_col(col_name):
ref = list(df[col_name].unique())
tar_col = list(df_train[col_name].unique())
chk = [elem for elem in ref if elem not in tar_col]
return chk
chk_cols= []
for col in cat_cols:
temp_lst = check_col(col)
if len(temp_lst)>0:
chk_cols.append(col)
print(chk_cols)
#####Output####
['PAY_5', 'PAY_6']
We get a unique list of levels of each categorical feature from the full dataset and check these levels in each of the corresponding features in the train set. If some levels are not available , we store in a chk list. We get all such features which have some levels missing in the chk_cols list. In our train dataset, PAY_5 and PAY_6 have missing levels and hence for ease of explanation, we will not use them for building our model. Our final list of categorical features will comprise 'SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4' .
In the next post, we will proceed to create embeddings out of these features and build a deep learning model.
References
- http://ethen8181.github.io/machine-learning/deep_learning/tabular/tabular.html
- https://arxiv.org/pdf/1604.06737.pdf
Comments
Post a Comment