PyTorch is a promising python library for deep learning. I have been learning it for the past few weeks. I am amused by its ease of use and flexibility. In this blog post, I will go through a feed-forward neural network for tabular data that uses embeddings for categorical variables.
If you want to understand the underlying concepts of using categorical feature embeddings, you should definitely check out this awesome post – An Introduction to Deep Learning for Tabular Data. I also did a deep dive in fastai’s tabular module to come up with this network.
PyTorch provides an excellent abstraction in the form of torch.util.data.Dataset. Such dataset classes are handy as they allow treating the dataset as just another iterator object. We will create a class named TabularDataset that will subclass torch.util.data.Dataset. Each iteration of an object of this class will return a list of three elements – the output sample value, a numpy array of continuous features, and a numpy array of categorical features of the sample.
from torch.utils.data import Dataset, DataLoader class TabularDataset(Dataset): def __init__(self, data, cat_cols=None, output_col=None): """ Characterizes a Dataset for PyTorch Parameters ---------- data: pandas data frame The data frame object for the input data. It must contain all the continuous, categorical and the output columns to be used. cat_cols: List of strings The names of the categorical columns in the data. These columns will be passed through the embedding layers in the model. These columns must be label encoded beforehand. output_col: string The name of the output variable column in the data provided. """ self.n = data.shape[0] if output_col: self.y = data[output_col].astype(np.float32).values.reshape(-1, 1) else: self.y = np.zeros((self.n, 1)) self.cat_cols = cat_cols if cat_cols else [] self.cont_cols = [col for col in data.columns if col not in self.cat_cols + [output_col]] if self.cont_cols: self.cont_X = data[self.cont_cols].astype(np.float32).values else: self.cont_X = np.zeros((self.n, 1)) if self.cat_cols: self.cat_X = data[cat_cols].astype(np.int64).values else: self.cat_X = np.zeros((self.n, 1)) def __len__(self): """ Denotes the total number of samples. """ return self.n def __getitem__(self, idx): """ Generates one sample of data. """ return [self.y[idx], self.cont_X[idx], self.cat_X[idx]]
Let’s move on to model building. Our model will be a simple feed-forward neural network with two hidden layers, embedding layers for the categorical features and the necessary dropout and batch normalization layers.

The nn.Module class is the base class for all neural networks in PyTorch. Our model, FeedForwardNN will subclass the nn.Module class. In the __init__ method of our class, we will initialize the various layers that will be used in the model and the forward method would define the various computations performed in the network.
import torch import torch.nn as nn import torch.nn.functional as F class FeedForwardNN(nn.Module): def __init__(self, emb_dims, no_of_cont, lin_layer_sizes, output_size, emb_dropout, lin_layer_dropouts): """ Parameters ---------- emb_dims: List of two element tuples This list will contain a two element tuple for each categorical feature. The first element of a tuple will denote the number of unique values of the categorical feature. The second element will denote the embedding dimension to be used for that feature. no_of_cont: Integer The number of continuous features in the data. lin_layer_sizes: List of integers. The size of each linear layer. The length will be equal to the total number of linear layers in the network. output_size: Integer The size of the final output. emb_dropout: Float The dropout to be used after the embedding layers. lin_layer_dropouts: List of floats The dropouts to be used after each linear layer. """ super().__init__() # Embedding layers self.emb_layers = nn.ModuleList([nn.Embedding(x, y) for x, y in emb_dims]) no_of_embs = sum([y for x, y in emb_dims]) self.no_of_embs = no_of_embs self.no_of_cont = no_of_cont # Linear Layers first_lin_layer = nn.Linear(self.no_of_embs + self.no_of_cont, lin_layer_sizes[0]) self.lin_layers =\ nn.ModuleList([first_lin_layer] +\ [nn.Linear(lin_layer_sizes[i], lin_layer_sizes[i + 1]) for i in range(len(lin_layer_sizes) - 1)]) for lin_layer in self.lin_layers: nn.init.kaiming_normal_(lin_layer.weight.data) # Output Layer self.output_layer = nn.Linear(lin_layer_sizes[-1], output_size) nn.init.kaiming_normal_(self.output_layer.weight.data) # Batch Norm Layers self.first_bn_layer = nn.BatchNorm1d(self.no_of_cont) self.bn_layers = nn.ModuleList([nn.BatchNorm1d(size) for size in lin_layer_sizes]) # Dropout Layers self.emb_dropout_layer = nn.Dropout(emb_dropout) self.droput_layers = nn.ModuleList([nn.Dropout(size) for size in lin_layer_dropouts]) def forward(self, cont_data, cat_data): if self.no_of_embs != 0: x = [emb_layer(cat_data[:, i]) for i,emb_layer in enumerate(self.emb_layers)] x = torch.cat(x, 1) x = self.emb_dropout_layer(x) if self.no_of_cont != 0: normalized_cont_data = self.first_bn_layer(cont_data) if self.no_of_embs != 0: x = torch.cat([x, normalized_cont_data], 1) else: x = normalized_cont_data for lin_layer, dropout_layer, bn_layer in\ zip(self.lin_layers, self.droput_layers, self.bn_layers): x = F.relu(lin_layer(x)) x = bn_layer(x) x = dropout_layer(x) x = self.output_layer(x) return x
After creating the network architecture we have to run the training loop. For the purpose of demonstration, I am using the dataset from the Kaggle competition – House Prices: Advanced Regression Techniques.
>>> import pandas as pd >>> import numpy as np >>> # Using only a subset of the variables. >>> data = pd.read_csv("train.csv", usecols=["SalePrice", "MSSubClass", "MSZoning", "LotFrontage", "LotArea", "Street", "YearBuilt", "LotShape", "1stFlrSF", "2ndFlrSF"]).dropna()
We need to instantiate an object of the TabularData class we created earlier. But before that, we need to label encode the categorical features. For this, we will be using sklearn.preprocessing.LabelEncoder.
>>> categorical_features = ["MSSubClass", "MSZoning", "Street", "LotShape", "YearBuilt"] >>> output_feature = "SalePrice" >>> from sklearn.preprocessing import LabelEncoder >>> label_encoders = {} >>> for cat_col in categorical_features: label_encoders[cat_col] = LabelEncoder() data[cat_col] = label_encoders[cat_col].fit_transform(data[cat_col])
Let’s instantiate an object of the TabularDataset class.
>>> dataset = TabularDataset(data=data, cat_cols=categorical_features, output_col=output_feature)
In order to run the training loop, we need to create a torch.util.data.Dataloader object. It serves the following purpose –
- creates batches from the dataset
- shuffles the data
- loads the data in parallel
>>> batchsize = 64 >>> dataloader = DataLoader(dataset, batchsize, shuffle=True, num_workers=1)
Now that we have created the basic data structure to run the training loop, we need to instantiate a model object of the FeedForwadNN class created earlier. This class requires a list of tuples, where each tuple represents a pair of total and the embedding dimension of a categorical variable.
>>> cat_dims = [int(data[col].nunique()) for col in categorical_features] >>> cat_dims [15, 5, 2, 4, 112] >>> emb_dims = [(x, min(50, (x + 1) // 2)) for x in cat_dims] >>> emb_dims [(15, 8), (5, 3), (2, 1), (4, 2), (112, 50)]
The number of continuous features used is 4. The hidden layer dimension is 50 and 100 for the first and second layers respectively. The embedding dropout used is 0.04. The hidden layer dropouts are 0.001 and 0.01.
>>> device = torch.device(“cuda” if torch.cuda.is_available() else “cpu” >>> model = FeedForwardNN(emb_dims, no_of_cont=4, lin_layer_sizes=[50, 100], output_size=1, emb_dropout=0.04, lin_layer_dropouts=[0.001,0.01]).to(device)
Finally, let’s run the training loop –
>>> no_of_epochs = 5 >>> criterion = nn.MSELoss() >>> optimizer = torch.optim.Adam(model.parameters(), lr=0.1) >>> for epoch in range(no_of_epochs): for y, cont_x, cat_x in dataloader: cat_x = cat_x.to(device) cont_x = cont_x.to(device) y = y.to(device) # Forward Pass preds = model(cont_x, cat_x) loss = criterion(preds, y) # Backward Pass and Optimization optimizer.zero_grad() loss.backward() optimizer.step()
Hope you found this useful. You can also find the entire code on my GitHub repository – pytorch-tabular. Thank You. 🙂
Thank you for great post!
Just one thing to point out is, I think
“`
optimizer.zero_grad()
“`
should be called before forward pass don’t you think so too ?
LikeLike
The forward pass does not care about the gradients and it does not modify them. I think we only need to zero the grads before we call loss.backward(). This should work fine.
LikeLike
@Yashu, that makes sense! thanks for the answer!
LikeLike
I get the error NameError: name ‘device’ is not defined when defining the model. It seems like an in-place error. Any idea on how to fix?
LikeLike
Sorry about that, I think I missed this line –
device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
LikeLike
Hello I have a problem in the training in the last for loop:
for y, cont_x, cat_x in dataloader
I dont get any variables out of the loop so it just does nothing.
If I make for y, cont_x, cat_x in dataloader.dataset: it says, that ‘numpy.ndarray’ object has no attribute ‘to’.
I would be very gratefull if you could help you.
Thank You
Jonas
LikeLike
[…] ilustrated diagram of the proposed tabular model ANN found here. This was key in explaining to me what embedding was actually […]
LikeLike
[…] A Neural Network in PyTorch for Tabular Data with Categorical Embeddings https://yashuseth.blog/2018/07/22/pytorch-neural-network-for-tabular-data-with-categorical-embedding… […]
LikeLike