Blog: Building a movie recommendation system using neural network embeddings

Data Science

IMDB stores over half a billion movies, and this selection fits the description of Big Data. Thus, it is a perfect testing ground for neural networks. Our data scientists conducted discovery to create a recommender that will suggest movies based on the user’s preferences. So, that the more movies you watch, the better AI learns what you like.

In this post, we share how to create a movie recommendation system based on neural network embedding.

Initial idea

Our team built a recommendation system on the idea that every streaming platform has the users rating that we can utilize to understand the preferences of a certain user and suggest a movie they would most likely enjoy.

We needed to get a deep understanding of the user’s tastes and accommodate every subtle nuance that influences the users’ choice. Our experts decided to leverage neural network embeddings. This approach allows building recommendations based on billions of customer data points. For instance, Amazon Music utilizes a similar algorithm to make suggestions for our playlists.

What are the neural network embeddings?

Neural network embeddings are continuous vectors that can represent any entities. We use these vectors as input for a machine learning model to allow for finding nearest neighbors in the embedding space.

Given enough data, we can train an algorithm to understand relationships between entities and automatically learn features to represent them. The advantage of embeddings-based models is that they are easy to scale. So you can constantly add new entities to the system.

Let’s visualize this idea to make it very simple. We create a system that contains the representations of users and recommendable items. In our case, these are movies.

Imagine, an x-y-z coordinate system, where the users and the movies are vectors. The vector’s direction depends on 100 data-points.  To define the movie vector we relied on the genre, general impression, cast, soundtrack, etc. As a movie vector appears close to the vector of a user with similar preferences they match. Here the user gets a recommendation of a movie that will satisfy them. This prediction bases upon the deep study of users’ behavior and tastes. Besides, the algorithm works so that the movies that one user enjoyed are highly probable to satisfy users with similar tastes.

Creating the neural network model

Every machine learning project starts with data preprocessing, which includes collecting, cleaning, and ordering the data. The high-quality data set is important, so we always choose to pay much attention to that step.

Our team started with dirty data collected from a streaming service that cleaned and ordered so that we could effectively manage it. After preprocessing the data we had a dataset of 100 000 movie assessments, 6000 movies, and 610 users.

Building a neural network

Based on the PyTorch framework, our team built a neural network where the input is entity indexes, such as an ID of the user or a movie.

The most challenging part was finding and implementing the optimal architecture of layers for the neural network. We had to write and debug several parts of the code for several parts of the architecture.

Also, to gain the perfect performance we have to balance the parameters for our particular system with a certain number of layers and a certain size of the embedding.

Training the model

Neural networks are batch learners trained on a set of samples at a time over many rounds. The challenging part of training the model is finding the right balance between the sets of samples and the number of rounds.

The Epoch parameter defines the number of rounds that training requires. The Batch Size parameter defines how much data we use to train the Epoch. Another important parameter is the Learning Rate that is a coefficient for the evaluation of changes.

The examples below show the result of the query from the Jupyter notebook:

The goal here is to provide the optimal training mode for our neural network. That means the developers find a certain amount of data and a certain number of repetitions for the system to assimilate data well.

Once the network is done training, we extracted the embeddings.

Below is the code for our model:

class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = F.relu([U, V], dim=1))
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.lin2(x)
        return x

def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    for i in range(epochs):
        users = torch.LongTensor(df_train.userId.values) # .cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values) #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
    test_loss(model, unsqueeze)

def test_loss(model, unsqueeze=False):
    users = torch.LongTensor(df_val.userId.values) #.cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print("test loss %.3f " % loss.item())


Working on our recommending solution we kept in mind that the deeper we understand our data, the more connections we see, and the more insights we find. Our investigation showed that embedding is the most effective approach to build a recommending solution that considers details. Neural network embeddings represent the most accurate recommendations we can achieve today.

Our team built a scalable and effective recommender that considers many parameters and capable of finding very subtle connections. Here are the numbers to back this conclusion.

To assess the accuracy of recommendations, we have RMSE — the coefficient of the prediction error, where lower values of RMSE indicate better fit. For instance, once we put together a recommendation solution its initial accuracy index was RMS=10. After we trained the system we gained the RMS=0,8, which is 15 times better comparing to where we started.

Embeddings vs rules-based recommendation systems

Comparing to the rules-based recommendation systems, the solutions that utilize embeddings consider many more nuances of the users’ behavior.

The trick is that rules-based systems are limited by the rules. Roughly speaking, if you like Joaquin Phoenix — the system will recommend you movies he starred in, if you mostly watch thrillers you will have thrillers in your recommendations. However, the same mechanism may not recommend you some pieces you would love if the rules do not specify certain criteria. For instance, for the streaming platforms, it means they lose potential subscribers, who won’t even know about awesome content.

Get your best recommendation solution

Our data scientists are keen on improving rates and exploring new sides of machine learning and Big Data. DB Best utilize cutting edge approaches to help our customers take their business to a new level and always build the most effective solutions.

In our posts, we demonstrate how drastically machine learning can improve your business rates and explain how we build these solutions. If you want to know how to get most of the Data Science on your project, feel free to contact us.

Share this...
Share on Facebook
Tweet about this on Twitter
Share on LinkedIn