Hey folks, if you are here, you already have Prevision.io’s R SDK installed and are ready to go. If not, please check our first blog post about setting up your environment.
In this blog post, we are going to see how we can easily push local data into Prevision.io thanks to the R SDK. The first thing to consider is that data sets, just like any resources involved in a Machine Learning project, belong to – guess what – a project!
Authentication to Prevision.io’s instance
First, we will make sure that you have loaded the SDK and established the connection to your Prevision.io instance. To do so please type:
library(previsionio) pio_url = "https://.prevision.io" pio_tkn = "" pio_init(pio_tkn, pio_url)
Replacing and with the appropriate value. If you’re not sure what this means, feel free to refresh your memory with the first blog post in this series.
Once done, we can create our project. Let’s name it “Electricity Forecast”. To do so type the following:
project = create_project(name = “Electricity Forecast”, description = “R SDK Demo project”)
We can verify that everything is fine by going to Prevision.io’s UI or by using the get_projects() function.
Fresh project created
If, by any chance, you want to share your project with someone on your Prevision.io’s instance, feel free to do it! We do offer some collaboration capabilities and right management.
To share your project with a mate, simply type the following:
create_project_user(project_id = project$`_id`, user_mail = "[email protected]", user_role = "admin")
Make sure to write the email of your mate and specify the user_role from the following choices:
- admin, for complete access
- contributor, read & write access but can’t demote admin
- viewer, read-only access
As of now, the project is totally empty. We need to fill it with some data in order to move forward. In order to facilitate this tutorial, I have already prepared a training and a validation (holdout) dataset for you. Here they are:
The training data set is about the electricity consumption of France on a 30’ time step starting from 2014-01-01 and ending on 2020-12-31. The testing dataset starts from the 2021-01-01 and ends the 2021-09-30.
Each dataset has 7 features :
- TS, the time stamp
- TARGET, the actual consumption of electricity (in MW)
- PUBLIC_HOLIDAY, boolean, 1 if (french) public holiday 0 otherwise
- TEMPERATURE, mean of temperature across France in °C
- LAG_1, 1 day lag of TARGET value
- LAG_7, 7 days lag of TARGET value
- fold, technical identifier used for cross validation strategy, based on year of TS
Because this kind of use case is sensitive to temperature and also to special days, we have a good starting point here even if we could get more features in order to obtain a better model. The point of this tutorial is to keep things easy (even if the final app I’ll showcase is based on a slightly more complex model with more features involved).
So, what should you do with these datasets? Well, you can first load them into your R environment using your favorite library (I love data.table personally) and explore datasets. For instance, here is a very quick sample that I have done with plotly for instance that displays the consumption distribution on the holdout dataset:
library(data.table) library(plotly) train = fread("C:/Users/Florian/Desktop/elec_train.csv") # Make this match the filepath valid = fread("C:/Users/Florian/Desktop/elec_valid.csv") # Make this match the filepath plot_ly(valid, x = ~ TS) %>% add_trace( y = ~ TARGET, name = 'ACTUAL CONSUMPTION', type = 'scatter', mode = 'lines', line = list(color = '#19293C', width = 1), showlegend = TRUE ) %>% layout( xaxis = list( title = "Time", gridcolor = 'rgb(255,255,255)', zeroline = FALSE ), yaxis = list( title = "Consumption (MW)", gridcolor = 'rgb(255,255,255)', zeroline = FALSE ), legend = list(x = 0.92, y = 0.1), title = 'Electricity consumption (MW) in France on the validation data' )
Electricity consumption (MW) in France on the validation data
I’ll let you play around with them for a bit, but if you want to leverage the Prevision.io platform, then you will need to import the data into the freshly created project. To do so, execute the following:
pio_train = create_dataset_from_dataframe(project_id = project$`_id`, dataset_name = "train", dataframe = train, zip = TRUE) pio_valid = create_dataset_from_dataframe(project_id = project$`_id`, dataset_name = "valid", dataframe = valid, zip = FALSE)
This function will, as its name suggests, create a dataset into Prevision.io coming from a R data frame that is loaded in memory with some options:
- project_id, is the id of the project in which the data set should belong to. Have in mind that if you have forgotten the project_id, you can still retrieve it thanks to the get_project_id_from_name() function that will take a project’s name and return the corresponding id
- dataset_name, the desired name of the dataset you are importing
- dataframe, the name of the R data frame
- zip, boolean argument that will zip the dataframe before sending it
After a while, you will see in Prevision.io’s UI that your project has 2 fresh datasets. Please note that this step will take some time to complete because:
- Dataframe will be compressed locally before being uploaded to the server (can be disabled if you set the zip argument to FALSE)
- The zip will then be uploaded to the platform (time will depend on your connection, hence zipping might be a good idea for big datasets, especially if you have slow internet speeds)
- Dataset in Prevision.io are parsed and automatically analysed. Also, statistics on them will be pre-computed. This step is clearly the longest one, especially for big datasets
Also, we could have used the create_dataset_from_file() function if we wanted to avoid the reading phase of the dataset into your R environment, which again, is convenient for high volumetric data.
Imported & parsed datasets into my own environment
If you want to see this list directly from your R environment, you’ll need to type the following function:
ds = get_datasets(project_id = project$`_id`)
One last thing to keep in mind: if you want to retrieve information about a specific dataset, you can get them by using the get_dataset_info() function that expects a dataset_id, which can be get thanks to get_datasets() [see above] or more easily, thanks to the get_dataset_id_from_name() function.
Now that data sets are being imported & parsed into Prevision.io, you can access statistics directly from the UI or just move to the next blog post series in which we will make Machine Learning models on them 🧐