# Health Checks for Machine Learning – A Guide to Model Retraining and Evaluation

## Cost of poor Machine Learning models

According to Netflix , a typical user on its site loses interest in 60-90 seconds, after reviewing 10-12 titles, perhaps 3 in detail. It proposes the recommendation problem as each user, on each screen finds something interesting to watch and understands why it might be interesting. For Netflix, maintaining a low retention rate is extremely important because the cost of acquiring new customers is high to maintain the numbers. According to them, the recommendation system saves them $1 billion annually. Since they invest so much in their recommendations, how do they even measure its performance in production? Take-Rate One obvious thing to observe is how many people watch things Netflix recommends. This is called take-rate. It is defined as the fraction of recommendations offered that result in a play. Effective Catalog Size (ECS) This is another metric designed to fine tune the successful recommendations. Netflix provides recommendation on 2 main levels. First – Top recommendations from overall catalog. Second – Recommendations that are specific to a genre.For a particular genre, if there are N recommendations,ECS measures how spread the viewing is across the items in the catalog. If the majority viewing comes from a single video, then the ECS is close to 1. If the viewing is uniform across all the videos, then the ECS is close to N. ## Machine Learning in production is not static – Changes with environment Lets say you are an ML Engineer in a social media company. In the last couple of weeks, imagine the amount of content being posted on your website that just talks about Covid-19. Almost every user who usually talks about AI or Biology or just randomly rants on the website is now talking about Covid-19. And you know this is a spike. The trend isn’t gonna last. But for now, your data distribution has changed considerably. What should you expect from this? As an ML person, what should be your next step? Do you expect your Machine Learning model to work perfectly? ## Model Drift As discussed above, your model is now being used on data whose distribution it is unfamiliar with. Let’s continue with the example of Covid-19. Not only the amount of content on that topic increases, but the number of product searches relating to masks and sanitizers increases too. There can be many possible trends or outliers one can expect. Your Machine Learning model, if trained on static data, cannot account for these changes. It suffers from something called model drift or co-variate shift. The tests used to track models performance can naturally, help in detecting model drift. One can set up change-detection tests to detect drift as a change in statistics of the data generating process. It is possible to reduce the drift by providing some contextual information, like in the case of Covid-19, some information that indicates that the text or the tweet belongs to a topic that has been trending recently. This way the model can condition the prediction on such specific information. Although drift won’t be eliminated completely. ## Solution – Retraining your model ### Questions to ask before retraining So far we have established the idea of model drift. How do we solve it? We can retrain our model on the new data. So should we call model.fit() again and call it a day? • How much data to take for retraining? How much new data should we take for retraining? Should we use the older data? If yes, what should be a good mix? Domain knowledge and experience can help here. If you know that the data distribution changes frequently, you can take a larger proportion of the new data. Similarly, if you don’t have that many new examples but your model performance worsens, you can take all the new data and a sizeable chunk of old data to retrain. • How frequent should you retrain? How frequent should our retraining jobs run? If you receive new data periodically, you might want to schedule retraining jobs accordingly. For example, if you are predicting which applicant would get admitted in a school based on his personal information, it makes no sense to run training jobs everyday. Because you get new data every semester/year. • Should you retrain the entire model? Should we train the entire model with the new data? If your model is huge, training the entire model is expensive and time consuming. Maybe you should just train a few layers and freeze the rest of the network. For instance, XLNet, a very large deep learning model used for NLP tasks, cost$245000 to train. Yes, that amount of money to train a Machine Learning model. Most likely you won’t use that amount of computation power, but training model on cloud GPUs/TPUs can be very expensive. If the model has a size in GBs, storing it can also bear significant cost. Freezing a portion of the model is possible in deep learning. It is hard to implement in traditional machine learning algorithms. But anyway training them isn’t computationally very expensive compared to deep learning models.
• Should you directly deploy after retraining?
Should we trust our retrained model blindly and deploy it as soon as it is trained? There’s still a risk of the new model performing poorly than the older model even after retraining on new data. It is often a good practice to let the old model serve the requests for some time after building the retrained model. The retrained model can generate shadow predictions. Meaning, the predictions won’t be used directly, but will be logged to check if the new model is sane. Once satisfied, the older model can be replaced with the newer model. It is possible to automate retraining deployment, but not advisable if you aren’t sure. Best way is to initially do manual deployments until you are sure.

There are many more questions one can ask depending on the application and the business.

## Setting up infrastructure for model retraining

As with most industry use cases of Machine Learning, the Machine Learning code is rarely the major part of the system. There are greater concerns and effort with the surrounding infrastructure code. Before we get into an example, let’s look at a few useful tools –

### Containers

Containers are isolated applications. You can contain an application code, their dependencies easily and build the same application consistently across systems. They run in isolated environments and do not interfere with the rest of the system. They are more resource efficient than virtual machines.

### Kubernetes

It is a tool to manage containers. It helps scale and manage containerized applications.

In our case, if we wish to automate the model retraining process, we need to set up a training job on Kubernetes. A Kubernetes job is a controller that makes sure pods complete their work. Pods are the smallest deployable unit in Kubernetes. Instead of running containers directly, Kubernetes runs pods, which contain single or multiple containers. The training job would finish the training and store the model somewhere on the cloud. We can make another inference job that picks up the stored model to make inferences.

The above system would be a pretty basic one. There is a potential for a lot more infrastructural development depending on the strategy. Let’s say you want to use a champion-challenger test to select the best model. You’d have a champion model currently in production and you’d have, say, 3 challenger models. All four of them are being evaluated. You decide how many requests would be distributed to each model randomly. Depending on the performance and statistical tests, you make a decision if one of the challenger models performs significantly better than the champion model. Very similar to A/B testing. In the above testing strategy, there would be additional infrastructure required – like setting up processes to distribute requests and logging results for every model, deciding which one is the best and deploying it automatically.

### Online Learning

Generally, Machine Learning models are trained offline in batches (on the new data) in the best possible ways by Data Scientists and are then deployed in production. In case of any drift of poor performance, models are retrained and updated. Even the model retraining pipeline can be automated. But what if the model was continuously learning? As in, it updates parameters from every single time it is being used. Close to ‘learning on the fly’. This helps you to learn variations in distribution as quickly as possible and reduce the drift in many cases.

For example, you build a model that takes news updates, weather reports, social media data to predict the amount of rainfall in a region. At the end of the day, you have the true measure of rainfall that region experienced. Your model then uses this particular day’s data to make an incremental improvement in the next predictions.

Online learning methods are found to be relatively faster than their batch equivalent methods. One thing that’s not obvious about online learning is its maintenance – If there are any unexpected changes in the upstream data processing pipelines, then it is hard to manage the impact on the online algorithm. Previously, the data would get dumped in a storage on cloud and then the training happened offline, not affecting the current deployed model until the new one is ready. Now the upstream pipelines are more coupled with the model predictions. In addition, it is hard to pick a test set as we have no previous assumptions about the distribution. If we pick a test set to evaluate, we would assume that the test set is representative of the data we are operating on.

## Conclusion

In this post, we saw how poor Machine Learning can cost a company money and reputation, why it is hard to measure performance of a live model and how we can do it effectively. We also looked at different evaluation strategies for specific examples like recommendation systems and chat bots. Finally, we understood how data drift makes ML dynamic and how we can solve it using retraining.

It is hard to build an ML system from scratch. Especially if you don’t have an in-house team of experienced Machine Learning, Cloud and DevOps engineers. That’s where we can help you! You can create awesome ML models for image classification, object detection, OCR (receipt and invoice automation) easily on our platform and that too with less data. Besides, deploying it is just as easy as a few lines of code. Make your free model today at nanonets.com

## You might be interested in our latest posts on:

Start using Nanonets for Automation

Try out the model or request a demo today!