In this tutorial you will learn what is a champion and challenger mechanism in the context of Machine Learning, how it works, why it is important and how to set it up and monitor using.

To reproduce the results in this experiment, you can download the Iris dataset from UCI Machine Learning repository here.


A common practice in a machine learning project is to train, validate and test different models and pick up the best one that fits a specific need. This process can be tedious but totally worth it. Indeed, one first needs to define what “best performing model” means.  As this can be seen in different ways such as the performance based on a set of specific metrics: accuracy, stability of the model, time execution, complexity, interpretability/explainability of the model, to cite a few. However, in some scenarios (e.g. retrained model), the performance of a model can be very close to another (previous version of a model) or one would want to have a challenger model that can be used to validate the results of a selected one. Another important aspect, especially in production, is the monitoring of deployed models. The monitoring can, for instance, give insights as to when a model needs to be retrained. In this article, we describe a unique feature in called the champion / challenger..

Importance of champion and challenger

When testing machine learning models, one usually chooses the best model based on a performance metric such as the AUC ROC and deploy it into production. This model is referred to as “champion”. However, after a retraining you may want to challenge this model with the newly trained one. This is a challenger model. For instance, in the case of a new version of the dataset, the machine learning models need to be retrained. Consequently, a champion model in a previous version of the dataset, might no longer be the best performing one. The second model, referred to as “challenger” in such a scenario, can be worth deploying as well to help the practitioners in their decision making. Similar to this scenario, when comparing several machine learning models, the performance of some models can be very close, i.e., statistically not significant. To overcome this problem, statistical comparison can be performed in order to choose the model to deploy (the one whose performance is statistically better than all other models). Alternatively, the champion and challenger mechanism can be used for instance to deploy the two best performing models. The challenger here can be used to to have an idea on the variance of the performance on the datasets in production. For example, if the performance of the champion is 0.94 (ROC AUC) and the one of the challenger 0.9, the difference can give an insight to the decision makers on the variability of the performance between the models.

Depending on the complexity of the models to be trained, the champion model can be chosen based on the interpretability (e.g. a logistic regression model) of the model and the challenger an explainable model (e.g. a neural networks model). Indeed, in some applications, explaining a model’s predictions becomes more important than the performance. We discussed the interpretability and explicability of AI models in detail in these blog posts series by Mathurin Ache (Explainability vs Interpretability Part I, Part II).

In, users have the ability to combine the predictions of different models. This process, called blending, has been shown to increase the performance of AI models compared to a single model. Hence, a use case of implementing the champion and challenger mechanism would be to choose a blended model as a champion, and a unique model as a challenger or conversely. This set up, can help to determine for instance over the evolution of the production dataset a unique model is preferable to combined ones.

Champion and challenger in offers the ability to set up a champion and challenger in all the types of experiments available in the platform (classification, regression, object detection…) . The set up of the champion and challenger comes when the AI models are built in or imported. From your train models in the experiments section, select the model you would like to deploy (the champion), for instance the best performing model based on the performance metric. See the following Figure 1, in which different models are trained to classify Iris flowers into three classes: 0 – Setosa. 1 – Versicolour, 2 – Virginica). The performance metric used for this classification is the log loss.

Figure 1: List of trained AI models for Iirs flowers classification problem.

Once the champion is chosen, click on the deploy button to create a deployment by filling the fields in Figure 2. Please note that the experiment and model fields are prefilled!

Figure 2: Creating a deployment with champion and challenger mechanism in

Enable the challenger model option and select a model to be set as challenger (here a light gradient boosting machine model), then choose the access rights to your deployment – et voila!

Once the model is deployed, the status of champion and challenger model (versions, scores, …) are reported in the main deployment interface (see Figure 3). generates an application from which each deployed model can be predicted on (see Figure 4). It should be noted that, on the default webapp of the deployed models, only the predictions of the main model (champion) are displayed.

Figure 4: Deployed application from Iris classification automatically generated from Prevision.

Champion and challenger monitoring

Along with the champion and challenger feature, offers monitoring tools for tracking the performance of models in production. The tools include the prediction and feature distribution.

  • Prediction distribution
    The prediction distribution monitoring feature provides visualization tools of the predictions obtained from the main model (champion) and the challenger. In the Iris flowers classification task, we simulated an object to be classified from the generated application (Figure 4) and checked the prediction distribution of the champion and challenger models. The obtained figures are respectively reported in Figure 5 and Figure 6.

Figure 6: Challenger model prediction distribution on Iris flowers classification

Similarly to the champion model, the challenger model predicted the same class with high confidence for the simulated object. Consequently, the champion and challenger mechanism in our example corresponds to the scenario discussed in the previous section where the performances of the models are close.

  • Feature distribution
    The feature distribution offers users the ability to visualize each feature in the train datasets production datasets (see Figure 7). This feature can be used to track any changes between the train and production datasets, i.e., the drift between the datasets.

Figure 7: Feature distribution fin train and production datasets on the Iris flowers classification.

The two preceding monitoring features can help to determine when a model is obsolete and need to be retrained. In addition to these features, also offers the ability to the users to swap the champion and challenger models without any interruption of the services.


In this article we presented the champion and challenger mechanism and discussed examples of scenarios it can be used. We illustrated through the Iris flower classification task how the mechanism can be setted up and present two monitoring features offered by to visualize the predictions distribution of each class label and the drift between the train and production datasets. Finally, we discussed the ability to hot swap the champion and challenger models in production leading to any service interruption.

Abdoul Djiberou

About the author

Abdoul Djiberou

Machine Learning Scientist