How to adopt Embeddings for Categorical features in Tabular Data using PyTorch's nn.Embedding( )-- Part 2
In the previous post, we set up the context to utilize embeddings for categorical features. In this post, we will figure out how to create these embeddings and combine them with other continuous features to build a neural network model.
Dataset Download
We will utilize the UCI machine learning repo which has a dataset on credit card default for customers in Taiwan. This dataset is also available in Kaggle. Metadata about this dataset is available on the respective websites. To follow this post, it is recommended to download the dataset from Kaggle. Most of the features are self explanatory.
Embedding Creation
A few definitions first. Levels in a categorical feature represent unique values available for that categorical feature. For e.g. MARRIAGE has levels 0,1,2,3. Each level of a categorical feature is represented by a vector of numbers. So, if you stack up all the levels together and all the vectors together, you can imagine levels to be a column vector of shape (n,1) and a corresponding matrix of shape (n,dim) where n is number of levels and dim is number of dimensions or columns of the embedding.
Levels in the column vector can be called keys. These keys are assigned values from 0 to n-1 in PyTorch. So all levels for a feature in the dataset need to be mapped between 0 to n-1 before we can utilize nn.Embedding().
In our data, the feature MARRIAGE has 4 levels 0,1,2,3. Each of the 4 levels can be represented by a vector of values. So MARRIAGE is represented by a matrix of 4x5 where 4 is number of levels in the MARRIAGE feature and 5 is the number of dimensions that we have chosen. The values that we see in the matrix are assigned randomly initially. There are many techniques to initialize weights During backpropagation, we will learn an optimized set of weights for this feature matrix. Also, notice that the keys of the lookup table go from 0 to n-1 where n is the number of levels in the categorical feature.
It was sheer coincidence that the levels of MARRIAGE feature had values 0,1,2,3 in the dataset. So understanding the lookup table was easy. But have a look at the lookup table for the Categorical feature SEX. The original levels in this feature are 1 and 2. But the lookup table has keys as 0 and 1 . Similarly, check out the lookup table for PAY_0. The lookup table has keys from 0 to 10 while the original levels in PAY_0 take values from -2 to 8.
Always remember that the lookup table will have keys from 0 to n-1 where n is the number of levels in the categorical feature. This means we need to some how map the levels in the original dataset to 0 to n-1. Once we have this mapping, the model can learn to optimize weights for each of the feature matrices. As shown in Fig 1, the model is expected to learn a good set of weights for MARRIAGE with dimension 4x5, SEX with dimension 2x3, PAY_0 with dimension 11x4.
Let's see all of this in action now. We first need to map the original levels in the categorical features to lie between 0 and n-1
.
df = pd.read_csv("../UCI_Credit_Card.csv")
cont_cols = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
cat_cols = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4']
target_col = 'default.payment.next.month'
cat_code_dict= {}
for col in cat_cols:
temp = df_train[col].astype('category')
cat_code_dict[col] = {val:idx for idx,val in enumerate(temp.cat.categories)}
print(cat_code_dict['SEX'])
--->{1: 0, 2: 1}
Next, we figure out the number of levels in each categorical feature.
embedding_size_dict = {key: len(val) for key, val in cat_code_dict.items()}
print(embedding_size_dict)
--->
{'SEX': 2,
'EDUCATION': 7,
'MARRIAGE': 4,
'PAY_0': 11,
'PAY_2': 11,
'PAY_3': 11,
'PAY_4': 11}
Now, we have to figure out the dimensions of embeddings for each of the categorical features. fast.ai has a discussion forum where this has been dealt with in detail. We will select dimension as minimum of 50 and half of number of levels available in each category. You are free to experiment with other heuristics as well.
embedding_dim_dict= {key: min(50,val//2) for key,val in embedding_size_dict.items()}
print(embedding_dim_dict)
---> {'SEX': 1, 'EDUCATION': 3, 'MARRIAGE': 2, 'PAY_0': 5, 'PAY_2': 5, 'PAY_3': 5, 'PAY_4': 5}
Next, we create Embedding Matrix for each of the categorical features. As mentioned above, we create embedding_size_dict and embedding_dim_dict to determine the number of dimensions for each feature. nn.Embedding() has to be fed 2 important parameters.
- num_embeddings: Refers to size of dictionary. It just means you provide number of levels for that categorical feature
- embedding_dim: Refers to the number of dimensions you need to represent this categorical feature.
We iterate through cat_cols and create an Embedding layer for each such feature by passing the relevant num_embeddings and embedding_dim. These embedding layers are stored in a embedding dictionary and wrapped in nn.ModuleDict. While we iterate, we also calculate the total dimensions of the categorical features and store in total_embed_dim. All these functionalities are now depicted by the below class method.
def _create_embedding_vectors(self):
'''
Create Embedding Layer for each of the categorical variable in dataset
'''
##Get no of levels in each categorical variable and store in dictionary
self.embedding_size_dict = {key: len(val) for key, val in self.cat_code_dict.items()}
##Determine dimension of embeddng vector for each categorical variable
self.embedding_dim_dict = { key: val//2 for key,val in self.embedding_size_dict.items()}
embeddings = {}
self.total_embed_dim = 0
for col in self.cat_cols:
num_embeddings = self.embedding_size_dict[col]
embedding_dim = self.embedding_dim_dict[col]
embeddings[col] = nn.Embedding(num_embeddings, embedding_dim)
self.total_embed_dim+= embedding_dim
return nn.ModuleDict(embeddings)
Data Preprocessing
We will define some helper methods to process our dataset and feed data to our model. We will also figure out how to combine categorical and continuous features to build a neural network. Preprocessing is common across train,test and validation datasets. All these classes defined below are fairly self explanatory.
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.preprocessing import StandardScaler
from collections import OrderedDict
import pickle
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from matplotlib import pyplot as plt
class Process_Dataset:
def __init__(self,path,cat_cols,cont_cols,target_col):
self.cat_cols = cat_cols
self.cont_cols = cont_cols
self.target_col = target_col
self.df= pd.read_csv(path) ##Load the dataset
self.cat_cols = cat_cols ##Initialize all categorical features
self.cont_cols = cont_cols ##Initialize all continuous features
self.target_col = target_col ##Set the target feature
self.data_split() ##Split the data into train and test set
self._preprocess()
self.scaler = StandardScaler()
self.df_train = self._process(self.df_train,1)
self.df_test = self._process(self.df_test)
self.df_val = self._process(self.df_val)
def data_split(self):
'''
Splits the data into 60% train set , 30% val set and 10% test set
'''
test_size = 0.1
val_size = 0.3
random_state = 1234
self.df_train,self.df_test = train_test_split(self.df, test_size=test_size, random_state = random_state, stratify = self.df[target_col])
self.df_train,self.df_val = train_test_split(self.df_train, test_size= val_size, random_state = random_state, stratify = self.df_train[target_col])
def _preprocess(self):
'''
Creates a mapping from 0 to n-1 for each level in every categorical feature
'''
self.cat_code_dict= {}
for col in self.cat_cols:
temp = self.df_train[col].astype('category')
self.cat_code_dict[col] = {val:idx for idx,val in enumerate(temp.cat.categories)}
def _process(self,_df, flag=0):
'''
We scale numerical variables using StandardScaler from scikit-learn
'''
_df = _df.copy()
if flag:
self.scaler.fit(_df[cont_cols])
# numeric fields
_df[self.cont_cols] = self.scaler.transform(_df[cont_cols])
_df[self.cont_cols] = _df[self.cont_cols].astype(np.float32)
# categorical fields
for col in self.cat_cols:
code_dict = self.cat_code_dict[col]
_df[col] = _df[col].map(code_dict).astype(np.int64)
# Target
_df[target_col] = _df[self.target_col].astype(np.float32)
return _df
We have completed defining the Process_Dataset. We now define a TabularDataset which inherits from Dataset class in PyTorch. Next, we define a PyTorch DataLoader to feed data to the model. The goal is to finally get tensors of categorical,continuous and target features out from the DataLoader.
Define Dataset and Dataloaders in PyTorch
class TabularDataset(Dataset):
def __init__(self, df, cat_cols,cont_cols,target_col):
self.cat_cols = cat_cols
self.cont_cols = cont_cols
self.target_col = target_col
self.df = df
def __len__(self):
return self.df.shape[0]
def __getitem__(self, idx):
cat_array = self.df[self.cat_cols].iloc[idx].values
cont_array = self.df[self.cont_cols].iloc[idx].values
target_array = self.df[self.target_col].iloc[idx]
cat_array = torch.LongTensor(cat_array)
cont_array = torch.FloatTensor(cont_array)
return cont_array, cat_array, target_array
###First do all the preprocessing (Scaling and splitting dataset)####
dataset = Process_Dataset("../UCI_Credit_Card.csv",cat_cols,cont_cols,target_col)
##Create train and test instances of Dataset class##
dataset_train= TabularDataset(dataset.df_train, cat_cols, cont_cols,target_col)
dataset_test= TabularDataset(dataset.df_test, cat_cols, cont_cols,target_col)
##Create train and test dataloaders##
train_loader = DataLoader(dataset_train,batch_size=16, num_workers=2,drop_last=True)
test_loader = DataLoader(dataset_test, batch_size=16, num_workers=2,drop_last=True)
val_loader = DataLoader(dataset_val, batch_size=8, num_workers=2,drop_last=False)
Neural Network definition in PyTorch
Our neural network is a stack of Linear Layer and a ReLU layer repeated once. The final Linear layer provides help with the class predictions.
class EntityEmbeddingNN(nn.Module):
def __init__(self,cat_code_dict,cat_cols,cont_cols,target_col,n_classes):
super().__init__()
self.cat_code_dict = cat_code_dict
self.cat_cols = cat_cols ##Initialize all categorical features
self.cont_cols = cont_cols ##Initialize all continuous features
self.target_col = target_col ##Set the target feature
self.embeddings = self._create_embedding_vectors()
self.in_features = self.total_embed_dim+ len(cont_cols)
self.layers = nn.Sequential(
nn.Linear(self.in_features,64),
nn.ReLU(),
nn.Linear(64,16),
nn.ReLU(),
nn.Linear(16,n_classes)
)
How do we combine categorical and continuous features though?. Observe the forward method below!!
def forward(self,cat_tensor,num_tensor):
embedding_tensor_group = []
for idx, col in enumerate(self.cat_cols):
layer = self.embeddings[col]
out = layer(cat_tensor[:,idx])
embedding_tensor_group.append(out)
embed_tensor= torch.cat(embedding_tensor_group, dim=1)
out_tensor = torch.cat((embed_tensor,num_tensor), dim=1)
out_tensor = self.layers(out_tensor)
return out_tensor
In the forward method, we first iterate through our categorical features. For each of the features, we select corresponding PyTorch Embedding layer created during initialization. We pass correspoding categorical feature data through the selected layer and append it to the embedding_tensor_group list.
If you observe the image below, each categorical feature from the sample is first converted to a number from 0 to n-1 based on the mapping present in cat_code_dict. Next, we select the appropriate vector corresponding to the modified values of the categorical features. Finally stacking occurs along the columns and this can be concatenated with the continuous features. Backpropagation will take care of the final weights of the embedding layer.
Finally, the resultant matrix is passed through the Linear+RELU layer to output out_tensor. This now becomes a Vanilla Neural Network to be trained. Training and validation would be similar to how we do for any regular neural network. I have linked a Colab Notebook for this here in case anyone wants to try this.
The Training Loop
#https://averdones.github.io/reading-tabular-data-with-pytorch-and-training-a-multilayer-perceptron/
##http://ethen8181.github.io/machine-learning/deep_learning/tabular/tabular.html
cont_cols = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
cat_cols = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4']
target_col = 'default.payment.next.month'
print(f"We will use {len(cat_cols)} categorical features")
print(f"We will use {len(cont_cols)} continuous features")
###First do all the preprocessing (Scaling and splitting dataset)####
dataset = Process_Dataset("/content/gdrive/MyDrive/Embedding/Data/UCI_Credit_Card.csv",cat_cols,cont_cols,target_col)
##Create train and test instances of Dataset class##
dataset_train= TabularDataset(dataset.df_train, cat_cols, cont_cols,target_col)
dataset_test= TabularDataset(dataset.df_test, cat_cols, cont_cols,target_col)
dataset_val= TabularDataset(dataset.df_val, cat_cols, cont_cols,target_col)
##Create train and test dataloaders##
train_loader = DataLoader(dataset_train,batch_size=8, num_workers=2,drop_last=True)
test_loader = DataLoader(dataset_test, batch_size=8, num_workers=2,drop_last=True)
val_loader = DataLoader(dataset_val, batch_size=8, num_workers=2,drop_last=False)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = EntityEmbeddingNN(dataset.cat_code_dict, cat_cols,cont_cols,target_col,1)
model= model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)
train_loss_per_iter = []
train_loss_per_batch = []
test_loss_per_iter = []
test_loss_per_batch = []
n_epochs=8
for epoch in range(n_epochs):
running_loss = 0.0
for idx, (cont_array, cat_array, target_array) in enumerate(train_loader):
cont_array = cont_array.to(device)
cat_array = cat_array.to(device)
target_array = target_array.to(device)
outputs = model(cat_array,cont_array)
loss = F.binary_cross_entropy_with_logits(outputs.squeeze(1),target_array)
# Zero the parameter gradients
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.item()
train_loss_per_iter.append(loss.item())
train_loss_per_batch.append(running_loss / (idx + 1))
running_loss = 0.0
model.eval()
with torch.no_grad():
for idx, (cont_array, cat_array, target_array) in enumerate(test_loader):
cont_array = cont_array.to(device)
cat_array = cat_array.to(device)
target_array = target_array.to(device)
outputs = model(cat_array,cont_array)
loss = F.binary_cross_entropy_with_logits(outputs.squeeze(1),target_array)
running_loss += loss.item()
test_loss_per_iter.append(loss.item())
test_loss_per_batch.append(running_loss / (idx + 1))
running_loss = 0.0
Model Validation
Validation Loop looks like this
def predict(model):
y_pred=[]
y_actual=[]
model.eval()
with torch.no_grad():
for idx, (cont_array, cat_array, target_array) in enumerate(val_loader):
y_actual.append(target_array)
cont_array = cont_array.to(device)
cat_array = cat_array.to(device)
target_array = target_array.to(device)
outputs = model(cat_array,cont_array)
y_prob = torch.sigmoid(outputs).cpu().numpy()
y_pred.append(y_prob)
y_pred = np.array([elem for ind_list in y_pred for elem in ind_list])
y_actual = np.array([elem for ind_list in y_actual for elem in ind_list])
return y_pred,y_actual
def compute_score(y_true, y_pred, round_digits=3):
log_loss = round(metrics.log_loss(y_true, y_pred), round_digits)
auc = round(metrics.roc_auc_score(y_true, y_pred), round_digits)
precision, recall, threshold = metrics.precision_recall_curve(y_true, y_pred)
f1 = 2 * (precision * recall) / (precision + recall)
mask = ~np.isnan(f1)
f1 = f1[mask]
precision = precision[mask]
recall = recall[mask]
best_index = np.argmax(f1)
threshold = round(threshold[best_index], round_digits)
precision = round(precision[best_index], round_digits)
recall = round(recall[best_index], round_digits)
f1 = round(f1[best_index], round_digits)
return {
'auc': auc,
'precision': precision,
'recall': recall,
'f1': f1,
'threshold': threshold,
'log_loss': log_loss
}
References
- http://ethen8181.github.io/machine-learning/deep_learning/tabular/tabular.html
- https://averdones.github.io/reading-tabular-data-with-pytorch-and-training-a-multilayer-perceptron/
Hope this tutorial helps clear lot of doubts in terms of using categorical features in embedding layers!!!
Great source, thanks!
ReplyDelete