If you are here, you probably already have the R SDK installed on your environment, configured and ready to go. Also, you have created a Prevision.io project containing 2 data sets. If not, please refer to the previous blog post from this series, available on our website.
Your project as of now, please note the number “2” near the data icon
First things first, we will create an experiment inside our project. An experiment can be seen as a container of models and versions built on datasets.
For the sake of the tutorial, we want to create a tabular regression experiment, based on Prevision.io AutoML that will later host our models. To do so, simply type the following:
experiment = create_experiment(project_id = project$`_id`, name = "day_ahead", provider = "prevision-auto-ml", data_type = "tabular", training_type = "regression")
Let’s describe this call:
- project_id, id of the project. Remember, if you have forgotten the project_id, you can still retrieve it thanks to the get_project_id_from_name() function that takes a project’s name and returns the corresponding id
- name, the name of your experiment
- provider, the experiment provider. Can be either “prevision-auto-ml” for AutoML modelling or can be set to “external” if you want to import your own, already built model
- data_type, type of data (“tabular”, “images” or “timeseries”)
- training_type, type of training you want to achieve (“regression”, “classification”, “multiclassification”, “clustering”, “object-detection” or “text-similarity”)
Also note that this project could have been done with the time series regression part of Prevision.io’s AutoML (and I’ll let you try it , but the added complexity isn’t worth it for this example).
Okay, you have created a blank experiment that you can see in your Prevision.io project:
Blank experiment created from the R SDK
The next step is actually to create a new “experiment_version” that will, in fact, start the actual modelling process. For a first try, we want to create a very basic linear regression, without much optimization as a starting point. To do it, please type the following:
experiment_version = create_experiment_version(experiment_id = experiment$`_id`, experiment_description = "First iteration with lite LR only", dataset_id = pio_train$`_id`, target_column = "TARGET", holdout_dataset_id = pio_valid$`_id`, fold_column = "fold", lite_models = list("LR"))
Let’s describe this call :
- experiment_id, id of the experiment on which we create a new version
- experiment_description, text explaining what is being done (as a reminder)
- dataset_id, id of the dataset we want to model on
- target_column, name of the column to learn in the training dataset
- holdout_dataset_id, id of the holdout / validation dataset
- fold_column, name of the column containing fold index in the training dataset
- lite_models, list of model families we want to test into our experiment. Models being listed here will use default hyperparameters (we will see later in this article how to optimise them )
Parameters of this function can be quite complex since they can be used for any kind of experiment, but don’t worry, all functions from the R SDK package are fully documented and can be accessed either on the documentation page of the product or better, directly thanks to the ? (here ?create_experiment_version) call if you use R Studio for example.
As you can see, a first version of the experiment has been trained with a quite basic linear regression model.
Predictive performance of the first version of our experiment
We can see in the UI that the RMSE estimated in the cross validation (that respects our folding strategy) is 4,925 and that the hold out score at 4,879 which is on par (keep in mind that the validation dataset is based on 2021 data, so it illustrates that this model is stable).
We can also get this level of analytics by typing (once the model is trained):
experiment_version_info = get_experiment_version_info(experiment_version$`_id`) experiment_version_info$experiment_version_params$metric # Metric optimised experiment_version_info$best_model$metric$value # Metric value estimated in CV experiment_version_info$holdout_score # Hold out true score
That being said, this level of performance isn’t “that” great for a day ahead forecast. Indeed, if we look at the RMSE between the TARGET and the LAG_1 in the validation dataset, we can see that we can achieve 3,972 RMSE which is better than the linear regression we just trained
Don’t worry, we will make a new version of the experiment, with way more models that will have their hyperparameters tuned and more feature engineering tested. To do so, we could try the following:
experiment_version_2 = create_experiment_version(experiment_id = experiment$`_id`, experiment_description = "Second iteration with lot of different algorithms and more feature engineering", dataset_id = pio_train$`_id`, target_column = "TARGET", holdout_dataset_id = pio_valid$`_id`, fold_column = "fold", normal_models = list("LR", "XGB", "CB", "RF"), features_engineering_selected_list = list("Date"))
As you can see, some options in this call have changed:
- normal_models, list of technologies that will be trained on and optimized. The same list of models can be set into lite_models or normal_models argument, but only the later one will provide hyperparameter optimization. Notable change here is that some gradient boosting trees and random forest have been added to the experiment.
- features_engineering_selected_list, list of feature engineering that Prevision.io will try during the modelling phase. Here we have added “Date”, which will extract date information from the TS column (day, month, year, weekday, hour, minute…)
Since we asked Prevision.io to train and optimize a couple of algorithms, this experiment will take more time to complete. However please keep in mind that it’s not necessary to wait for the absolute end of it. As long as you have a model available that matches your criteria, we can stop the training of the experiment. If this is something you are interested in, just type stop_experiment_version(experiment_version_id = experiment_version_2$`_id`). For the sake of this tutorial, I have let it run until the end.
As you can see here, 9 models have been trained with a much better score. We achieved around 1,300 RMSE on CV and ~ 1,500 on the validation data set (to be fair, we see here a little spread induced mostly by lockdown because of COVID in France [who said business knowledge was useless? ]).
Second iteration, way better performance 💪
This increase of performance is explained by 2 major changes we have made:
- Hyperparameter tuning & date feature extraction: the linear regression that scored ~ 4,000 on the previous run scores a ~ 2,500 RMSE here
- Adding new algorithms within the experiments. Here, Catboost (short for “CB”) provide the highest performance, decreasing the error to ~ 1,500
The optimised linear regression of this run outperform the first version
Algorithm family matters here, gradient boosting trees lead the pack by far
Starting from here, you can iterate over and over by:
- Adding some more feature engineering to the experiment (param: features_engineering_selected_list)
- Giving more time to optimize hyperparameters (param: profile)
- Stack models (param: with_blend), even if I recommend you testing it on more complex datasets
I’ll let you toy with the documentation if you want to go deeper.
If you want to analyse more precisely the results, typically of the best model here, you can either do it in the UI or thanks to the R SDK. For the latter, you first need to retrieve the id of the model corresponding to the “CB-3” in my case. You can do it in two ways :
- Either identify the resource within the URL. The model_id is located after the /model as seen on the following screenshot (cheap but effective way)
- Or by retrieving it directly from the R SDK typing
experiment_version_info_2 = get_experiment_version_info(experiment_version_2$`_id`) experiment_version_info_2$best_model$`_id`
The string should hopefully match
Once done, feel free to toy with available functions. For instance, let us retrieve the feature importance of our model:
fi = get_model_feature_importance(model_id = experiment_version_info_2$best_model$`_id`)
Here is the content of fi as displayed to my local RStudio. We clearly see that daily seasonality is important since the LAG_1 represents ~ 35% of the available information.
Feature importance of our CB-3 model
Lot of functions can be used in the context of analytics, such as:
- get_model_feature_importance (see above)
- get_model_hyperparameters, retrieve hyperparameters found by Prevision.io
- get_model_cv, retrieve the complete cross validation file. This is a must have for advanced custom analytics!
- get_model_infos, retrieve overall model information (training time, prediction duration, error estimation, stability estimation, …)
That being said, we now have a model that is “good enough”. We will see how we can deploy it in the next article, before starting to feed it with data coming from pipelines.