Stop being a loser, use for top rankings in Data Science competitions

In this post, I will show you how using automated machine learning capabilities from can boost your chance to move up in the leaderboard of a machine learning competition. Some would consider it as cheating, but who tells you that the winners did not use these means to end up ahead of the competition 🙂 .

We chose for this post, the most well known machine learning worldwide platform competition: Kaggle. This platform offers many advantages: a large array of different kinds of machine learning competitions/datasets, discussion forums, a jupyter notebook environment, and notebooks shared with other kagglers.

The competition that we chose is the following House Prices – Advanced Regression Techniques: we aim to predict house prices with a large variety of house features.


First Model Iteration

We will use to quickly build a baseline model, and explore some analysis elements that we can use for data exploration and feature engineering ideas.


1-   Create a new project

Once you are connected on your instance (go to the Try it now button on for a free trial if you need access) click on the button on the top right of the screen of the home page, to create a new project, you can set up the name of your project and add a small description (optional):


2-   Import your dataset:

To import the competition dataset you have to download it beforehand from the kaggle platform, then upload it in your project by clicking on the datasets tab on the left vertical bar then click on Create Dataset button.

Then select Import Dataset option and upload you dataset from your machine

Import Dataset View

Choice of the metric:

Here we don’t have much choice: we have to respect the metric that will be used to evaluate the submission:

Here it is mentioned that:

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.”

So the metric that we will chose is the RMSLE – Root Mean Squared Logarithmic Error


First iteration configuration:

We will select the Quick “profile” and the default model selection config. The configuration that has to be fixed is:

  • the corresponding metric (RMSLE)
  • the dataset (imported in the previous section)
  • the target and the Id column
Feature engineering:

Please take into account that Prevision integrates a large array of feature engineering that can be selected/unselected :

Categorical features:

  • Frequency encoding: modalities are converted to their respective frequencies in the dataset
  • Target encoding: modalities are replaced by the average of the target, grouped by modality

Advanced features:

  • Polynomial features: features based on products of existing features are
  • PCA: main components of the PCA
  • K-means: K-means cluster number are added as new features
  • Row statistics: features based on row by row counts are added as new features (number of 0, number of missing values, …)


For more information check the official documentation here

Note: Some feature engineering steps have to be done manually in the context of a competition, such as

  • Feature extraction – add new features from existing one (addition, substruction, division, aggregation, …)
  • Advanced missing value imputation depending on the feature type & distribution
  • Discarding out of range data that are in the train set and not in the test set in order to force the matching of data distribution


In the first experiment, I got as a cross validation performance 0.134, you can test a first submission:

  • upload the test dataset on your project workspace
  • Go to the “Predictions” tab and launch new predictions :
    • Select the best model
    • Select the test dataset
    • Launch the predictions

Then download the predictions in kaggle platform to have an idea of what would be your rank with this baseline model.

I got in the first try submission, the rank 1248/4719 which is not too bad for a first try.


Apply Kagglers tricks using


1-   Use other types of models:

In the first iteration, I selected Linear Regression and Xgboost. You can create new “versions” of your experiment using other types of models: for example in the second version I selected CatBoost and Xgboost: Catboost was more performant. In the third version it was interesting to compare Catboost with lightgbm…etc


2-   Increase the Cross validation folding:

Another simple yet efficient technique to slightly increase the model performances consists in increasing the folds number.

This can be done directly in UI by changing the training Profile from Quick (which uses 3 folds) to Normal (4 folds) or Advanced (5 folds):

Please take into account that some feature engineering built-in transformations Statistical based encoding (freq/target encoding) or PCA/k-means based features are performed along with cross validation training technique.


3-   Drop most misclassified samples from the dataset:

Usually, the very badly classified samples are most probably wrong.You can get a cleaner dataset by dropping them, which will increase the model performances:

First you can download the cross validation of your model and extract the top 5% most declassified samples, drop them from the training dataset and re-launch the experiment.


4-   Blending:

Model Blending is a type of model stacking that many kagglers use to increase performances. The technique consists in :

  1. training diversity models from original features of your training dataset (1-Level models),
  2. training 2-Level models from cross validation predictions of 1-Level models,
  3. 3-Level model is obtained by averaging 2-Level

⇒ It usually ends up with a more efficient overall model than a single one.

3-Level Models Stacking using in

Remark: However keep in mind that this type of model are rarely used in a real business project because they are not fast, which is not very practical if the model is in production in a high latency intolerant systems, and especially they are not easily explainable (which is not the best for non specialists that wants to understand the results provided by model)

This step can be very tedious when done manually. Fortunately, with Prevision all you have to do is to switch on th “Blend” option in the model configuration:

Models settings in Experiment Configuration Tab Execution Graph

The execution graph allows you to see the progress of the tasks in It is also on this graph that you can find the 3-level models described above.

1-   Pseudo-labelling

A very common technique that kagglers always use to move up in the leaderboard is called the Pseudo labelling: It consists in adding confident predicted test data to your training data. Checkout this excellent post to get more information about pseudo-labelling.

  • Select your best model found by
  • Predict your test submission using the confidence option: it would add interval confidence colum

  • Add confident predicted test observations to the initial training data
  • Re-launch a new experiment on the combined data


6- Apply all tricks and submit on Kaggle

Submission Score using

Top 3% on House Prices Competition

By combining the advice described above and the variety of features by, you can achieve the score of 0.11511, allowing you to rank 123 out of 4731 (top 3%).

To repeat the Youtuber Khabane Lame’s famous gestures:

Khabane Lame Image from Deep Dream Generator


I hope you enjoyed the post! And that you’ll try out to participate in data science competitions. It will save you a lot of time and automate many operations that can be very painful to implement manually! Sometimes it takes me a whole weekend to barely send two submissions. With platform it is straightforward and you can test different experiment configurations with just a few clicks.

Zeineb Ghrib

About the author

Zeineb Ghrib

Data Scientist