In this post, I will discuss a very common problem that we face when dealing with a machine learning task –
How to handle categorical data especially when the entire dataset is too large to fit in memory?
I will talk about how to represent categorical variables, the common problems we face while one hot encoding them and then discuss the possible solutions. I will particularly focus on how to deal with categorical variables when the data does not fit in the machine memory. Then, I will talk about a python module that I have created that lets you do all this.
Before diving into the out of memory problem lets just get a light introduction to categorical variables and one hot encoding.
What is a Categorical Variable? A categorical variable is a variable that can take a limited (usually fixed) number of values on the basis of some qualitative property. The number of people in a city is a continuous variable because it can take any possible value. But, the sex of an individual can have only fixed values, hence a categorical variable.
How to represent categorical variables?
Why do categorical variables need special attention at all? Why can’t a variable like color with red, blue or green as its possible values be directly fed as input into a machine learning model? The answer is simple, the machine learning models are based on mathematical equations and they would not want to mix strings and equations.
The solution? We can convert our categories into numerical labels. So red becomes 1, blue becomes 2 and green becomes 3. Now, the mathematical equations can handle these numbers. But do you see a problem with this approach? It assumes that the color blue is always greater than red or the color green is three times greater than red. There is no relational ordering between these variable values. We can’t represent our variable like this.
Now what? What we do instead is we create a Boolean column for each category. Each column representing (with a 0 or 1) whether a particular value (color, in our case) is present or not. And only one of these columns can take on the value 1 for each sample. This is essentially known as one hot encoding.
Common Problems with One Hot Encoding
We cannot simply convert our categorical variables into one hot encoded vectors because –
- Our test set may have some values previously unseen in the training set. For example, our training set had only red, blue and green colors. But what if there are some samples in our test set that have a brown color. We cannot have three columns for our training set color variable and four for our test set.
- Our training set may have some values unseen in the test set.
- If our data is too large to fit in the memory, we can read it in chunks but we cannot guarantee that all values of a variable are present in every chunk.
Therefore we need a tool that can handle all these problems without much hassle.
The sklearn.preprocessing.CategoricalEncoder comes close but it has its own drawbacks. Let’s see how we can use it before discussing the problems.
(Please note that as of December 2017, the sklearn.preprocessing.CategoricalEncoder module is not available in the latest stable release. In order to use it you will have to install the latest development version of scikit-learn.)
>>> import pandas as pd >>> import numpy as np >>> from sklearn.preprocessing import CategoricalEncoder # This data can be found here - # https://www.kaggle.com/c/titanic/data >>> data = pd.read_csv("titanic.csv") >>> sex = data["Sex"] >>> sex.unique() array(['male', 'female'], dtype=object) >>> encoder = CategoricalEncoder() >>> encoder.fit(sex.values.reshape(-1, 1)) >>> encoder.transform(sex.head().values.reshape(-1, 1)).todense() <5x2 sparse matrix of type '<class 'numpy.float64'> with 5 stored elements in Compressed Sparse Row format>
It can be preferred over –
- pandas.get_dummies – because get_dummies cannot handle the train-test framework.
- sklearn.preprocessing.OneHotEncoder – because the CategoricalEncoder can deal directly with strings and we do not need to convert our variable values into integers first.
But, it does not work when –
- our entire dataset has different unique values of a variable in train and test set.
- or the data is too large to fit in the memory of our machine? And we want to train our dataset in batches.
The dummyPy library
I have created a python module that solves all these problems. Let’s see how we can use it.
First, you need to copy the dummyPy.py file into your working directory. You can find it here.
Edit (April 7, 2018): The dummyPy library is now available on PyPI. You can install it using pip.
pip install dummyPy
Then in your working script import the OneHotEncoder class.
from dummyPy import OneHotEncoder
For demonstration purpose, I have split our titanic dataset in a way that the Embarked column has value “S” only in the test set.
import pandas as pd data = pd.read_csv("titanic.csv", usecols=["Pclass", "Sex", "Age", "Fare", "Embarked"]) train_data = data[data["Embarked"] != "S"] test_data = data[data["Embarked"] == "S"]
Now, let’s create a OneHotEncoder object and fit it on our training and test set. We will have to supply a list of variables that are categorical in our dataset.
encoder = OneHotEncoder(["Pclass", "Sex", "Embarked"]) encoder.fit(train_data) encoder.fit(test_data) # Here Embarked contributes 4 columns, Pclass - 3 columns, # Sex - 2 columns and 1 column each for Fare and Age. # Fell free to verify this on your own. encoder.transform(data).shape Out: (891, 11)
Here, we are fitting our encoder object on test data as well, which is unusual. But, what you have to understand here is that this encoder object only needs to be fit on the input feature variables and not on the output labels.
We saw that our training set did not have all the possible values for our categorical variable. But fitting it on both the train and test set we can have the necessary one hot encoded columns.
Now, let’s see how we can use it on a dataset that is too large to fit in the machine memory. For demonstration, I use the Titanic dataset, with each chunk size equal to 10. This can be extended to a larger dataset with a suitable chunk size.
We will read the data in chunks. This can be done by the “chunksize” parameter of pandas read_csv method.
data = pd.read_csv("titanic.csv", usecols=["Pclass", "Sex", "Age", "Fare", "Embarked"], chunksize=10)
Now, let’s fit our encoder on this.
encoder = OneHotEncoder(["Pclass", "Sex", "Embarked"]) encoder.fit(data) sample_data = pd.read_csv("titanic.csv", usecols=["Pclass", "Sex", "Age", "Fare", "Embarked"], nrows=100) encoder.transform(sample_data).shape Out: (100, 11)
You can find the entire code for the OneHotEncoder class on my GitHub repository.
- What is one hot encoding and when is it used in data science? on Quora
- What is one hot encoding? Why and when do you have to use it?
- Why does one hot encoding improve machine learning-performance? on Stackoverflow
Do you have any questions about one hot encoding your data?
Ask your questions in the comments and I will do my best to answer.
3 thoughts on “How to One Hot Encode Categorical Variables of a Large Dataset in Python?”
[…] How to one hot encode categorical variables of a large dataset in Python? […]
Hi Yashu, thanks sharing your amazing work. This is really helpful. I have a question on fitting the encoder on test set:
doesn’t this trigger information leaking?
What is the best way to approach this problem when fitting a model in a real world situation, where the test set is unseen?
Many thanks and looking forward for your reply
“doesn’t this trigger information leaking?”
Yes this can cause information leaking. But, we generally work under the assumption that we will not get new levels in our test (real world) data. When we split our data into train and validation this can cause some levels to be present in only one of the splits. This is happening because we made a split manually and not because our general (real world) data has this property. Hence, we can fit the encoder on both train and validation data. (By fitting on the test data I mean the manually created validation data). And anyways we will train our model on both (train and validation) the data before predicting on the test (real world) data, hence this can be convenient.
“What is the best way to approach this problem when fitting a model in a real world situation, where the test set is unseen?”
In a real world scenario, this is anyways out of our control. Since there might be new levels of a feature that we have never trained on. Hence, there is no other option but to ignore those levels. The current version of the library will give you all zeros for such cases.