Stop being a loser, use Prevision.io for top rankings in Data Science competitions
In this post, I will show you how using automated machine learning capabilities from Prevision.io can boost your chance to move up in the leaderboard of a machine learning competition. Some would consider it as cheating, but who tells you that the winners did not use these means to end up ahead of the competition 🙂 .
We chose for this post, the most well known machine learning worldwide platform competition: Kaggle. This platform offers many advantages: a large array of different kinds of machine learning competitions/datasets, discussion forums, a jupyter notebook environment, and notebooks shared with other kagglers.
First Model Iteration
We will use Prevision.io to quickly build a baseline model, and explore some analysis elements that we can use for data exploration and feature engineering ideas.
1- Create a new project
Once you are connected on your Prevision.io instance (go to the Try it now button on www.prevision.io for a free trial if you need access) click on the button on the top right of the screen of the home page, to create a new project, you can set up the name of your project and add a small description (optional):
2- Import your dataset:
To import the competition dataset you have to download it beforehand from the kaggle platform, then upload it in your project by clicking on the datasets tab on the left vertical bar then click on Create Dataset button.
Then select Import Dataset option and upload you dataset from your machine
Import Dataset View
Choice of the metric:
Here we don’t have much choice: we have to respect the metric that will be used to evaluate the submission:
Here it is mentioned that:
“Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.”
So the metric that we will chose is the RMSLE – Root Mean Squared Logarithmic Error
First iteration configuration:
We will select the Quick “profile” and the default model selection config. The configuration that has to be fixed is:
- the corresponding metric (RMSLE)
- the dataset (imported in the previous section)
- the target and the Id column
Please take into account that Prevision integrates a large array of feature engineering that can be selected/unselected :
- Frequency encoding: modalities are converted to their respective frequencies in the dataset
- Target encoding: modalities are replaced by the average of the target, grouped by modality
- Polynomial features: features based on products of existing features are
- PCA: main components of the PCA
- K-means: K-means cluster number are added as new features
- Row statistics: features based on row by row counts are added as new features (number of 0, number of missing values, …)
For more information check the official documentation here
Note: Some feature engineering steps have to be done manually in the context of a competition, such as
- Feature extraction – add new features from existing one (addition, substruction, division, aggregation, …)
- Advanced missing value imputation depending on the feature type & distribution
- Discarding out of range data that are in the train set and not in the test set in order to force the matching of data distribution
In the first experiment, I got as a cross validation performance 0.134, you can test a first submission:
- upload the test dataset on your project workspace
- Go to the “Predictions” tab and launch new predictions :
- Select the best model
- Select the test dataset
- Launch the predictions
Then download the predictions in kaggle platform to have an idea of what would be your rank with this baseline model.
I got in the first try submission, the rank 1248/4719 which is not too bad for a first try.
Apply Kagglers tricks using Prevision.io:
1- Use other types of models:
In the first iteration, I selected Linear Regression and Xgboost. You can create new “versions” of your experiment using other types of models: for example in the second version I selected CatBoost and Xgboost: Catboost was more performant. In the third version it was interesting to compare Catboost with lightgbm…etc
2- Increase the Cross validation folding:
Another simple yet efficient technique to slightly increase the model performances consists in increasing the folds number.
This can be done directly in Prevision.io UI by changing the training Profile from Quick (which uses 3 folds) to Normal (4 folds) or Advanced (5 folds):
Please take into account that some feature engineering built-in transformations Statistical based encoding (freq/target encoding) or PCA/k-means based features are performed along with cross validation training technique.
3- Drop most misclassified samples from the dataset:
Usually, the very badly classified samples are most probably wrong.You can get a cleaner dataset by dropping them, which will increase the model performances:
First you can download the cross validation of your model and extract the top 5% most declassified samples, drop them from the training dataset and re-launch the experiment.
Model Blending is a type of model stacking that many kagglers use to increase performances. The technique consists in :
- training diversity models from original features of your training dataset (1-Level models),
- training 2-Level models from cross validation predictions of 1-Level models,
- 3-Level model is obtained by averaging 2-Level
⇒ It usually ends up with a more efficient overall model than a single one.
3-Level Models Stacking using in Prevision.io
Remark: However keep in mind that this type of model are rarely used in a real business project because they are not fast, which is not very practical if the model is in production in a high latency intolerant systems, and especially they are not easily explainable (which is not the best for non specialists that wants to understand the results provided by model)
This step can be very tedious when done manually. Fortunately, with Prevision all you have to do is to switch on th “Blend” option in the model configuration:
Models settings in Experiment Configuration Tab
Prevision.io Execution Graph
The execution graph allows you to see the progress of the tasks in Prevision.io. It is also on this graph that you can find the 3-level models described above.
A very common technique that kagglers always use to move up in the leaderboard is called the Pseudo labelling: It consists in adding confident predicted test data to your training data. Checkout this excellent post to get more information about pseudo-labelling.
- Select your best model found by Prevision.io
Predict your test submission using the confidence option: it would add interval confidence colum
- Add confident predicted test observations to the initial training data
- Re-launch a new experiment on the combined data
6- Apply all tricks and submit on Kaggle
Submission Score using Prevision.io
Top 3% on House Prices Competition
By combining the advice described above and the variety of features by Prevision.io, you can achieve the score of 0.11511, allowing you to rank 123 out of 4731 (top 3%).
To repeat the Youtuber Khabane Lame’s famous gestures:
Khabane Lame Image from Deep Dream Generator
I hope you enjoyed the post! And that you’ll try out Prevision.io to participate in data science competitions. It will save you a lot of time and automate many operations that can be very painful to implement manually! Sometimes it takes me a whole weekend to barely send two submissions. With Prevision.io platform it is straightforward and you can test different experiment configurations with just a few clicks.