In this blog post, we are going to give an overview of “pipelines”. First we are going to define them and then show you how you can interact with them.

Pipeline overview

A pipeline is a collection of steps that accomplish some tasks. This is putting it simply, but it could be, for example (from simpler to more complex ones):

  • Retrieving a dataset from an external data source (example: a database) and storing it as a dataset on a regular basis
  • Retrieving a dataset, creating some features on it using functions that are provided by and storing it
  • Retrieving a dataset, creating features using a mix of already existing functions + custom ones and sending this expanded dataset to a deployed experiment in order to get a prediction from it
  • Retrieving a dataset, doing whatever on it and retraining a new version of an existing experiment


A lot of things can be done thanks to the pipeline feature. The idea is that as long as you want to automate / replicate and eventually schedule a group of tasks, then a pipeline might be the technical answer to it.


Note that in my example, all start with some sort of dataset retrieval because that’s what we do most of the time in real life, but not necessarily always 😉


A pipeline revolves around 3 pillars:

  • Pipeline components: low level operation applied to a resource
  • Pipeline templates: template of multiple chained pipeline components
  • Pipeline run: an instance of a pipeline template that can be scheduled and executed (and monitored) multiple times

Pipeline components

A pipeline component is a low level operation that can be applied to existing resources. Some of them are already built by, but you are free to create your own custom pipelines in your favorite language! pipeline components

Version 11 and higher of the platform includes a library of already built pipeline components. They can be seen directly in the UI in Pipelines > Pipelines components > “Show         Prevision components”

Example of some pipeline components already built by


They can also be listed thanks to the following R function :

get_pipelines(project$`_id`, “component”)


Most of them belong to one of the following categories:

  • Retrain of an experiment
  • Prediction of a deployed experiment
  • Dataset read / write operation
  • Dataset basic transformation (filter outliers, sample…)
  • Dataset feature augmentation (add weather feature, add special days information…)


Note that for the latter, some requirements are needed and are explained in the description of the component of your choice.

This component will add special days depending of a country code and a date column


Again, there is an equivalent R function that can allow you to retrieve this level of information:

get_pipeline_info(pipeline_id, “component”) # set the pipeline_id

Custom pipeline components

While pre-built components from are a must-have and will definitely simplify your data science journey, they may be insufficient for more advanced projects. For instance, you may want to do some advanced feature engineering, use your own prebuilt word embeddings based on an external corpus… To do it, you will have to create your custom components and we’ve got you covered here!


To accomplish this, we have a github public repository with resources aimed at ease of custom component creation.


The general idea is that you have to create a repository (github or gitlab are the two types supported in the actual version of and submit your component code in it alongside the yaml & docker configuration file (please check the readme from our public repository). Then, you can import your custom component from the UI (to date, importing resources connected to an external GIT repository isn’t supported by SDKs).

Importing a new custom component from UI


Once done (and this can take some time depending of requirements you have), it will be listed and available either through UI or also through SDK:

pipeline_component = get_pipeline_info(pipeline_id = “6179618ffa2bb5001c2330e9”,
                                       type = “component”) # update pipeline_id to match your own

Pipelines templates

Pipelines templates are a succession of pipeline components chained together. The idea is that you can make multiple operations, either coming from components or custom components in order to make a template that fully meets your needs.

As of today, templates are mono input and mono output, even if custom components are a little more flexible than that. In order to create a template, you can do so thanks to the web UI (and I have to agree that it’s way more convenient than through SDK). To do so, go to the Pipelines > Pipelines templates screen and create a new empty template:

New pipeline template button

Give a name plus an explicit description

And just like that, you have created a default empty template. Please make sure to locate the “+” icon in the middle of the screen and click on it:

Default empty template

List of node that you can add after clicking the “+” icon

Then you can add a node that fits your needs, including:

  • Import → import datasets, already present in your environment or coming from data sources
  • Export → export datasets to your environment or in external data sources
  • io components → various components provided by (sample, data augmentation, outlier filtering, …)
  • Custom components → your own previously imported components
  • Predict → prediction on a deployed experiment (so make sure to actually deploy an experiment before using this 🙂 )
  • Retrain → retrain an experiment, this will automatically create a new version of it

Here is what I decided to do (which is fairly basic for this example):

  • Import a dataset already present into
  • Launch a prediction on a deployed experiment
  • Export results as a data set directly into my environment

As you can see, the template is pretty simple and generic since we haven’t said which dataset to import or which experiment to predict on. That means that it can be used in multiple “pipeline runs” (in which we can instantiate our template).

Template importing a dataset, sending it into a deployed experiment and saving results

Of course, you are free to make this template more complex and why not code your own custom component in the middle that can retrieve real time data or make advanced features engineering?

Pipeline runs

Since we have (at least) one pipeline template ready to go, we can now create a run on top of it.

A pipeline run is just an instance of a template, which is configured (= all nodes requiring parameters will be filled) and that can be scheduled on a regular basis or just launched manually, at your convenience.

To do so, you can do it using the UI:

Schedule a new run

Then fill in the information as requested. The most important is filling the template that will be used each time the run is launched.

Input the template previously created in the run definition

You are almost done!  All you need to do here is fill in the parameters of nodes (if required)

Here, indicate that you want to load the “valid” dataset. Don’t forget to save changes 😉

Here, indicate the deployment of your model and hit the save button.

Finally, name the output “prediction_pipeline_sdk”

The last step consists of choosing the trigger method of your run, including:

  • Manual trigger whenever you want
  • Periodic trigger at a fixed time (hourly, daily, weekly, …)

Manual trigger

Periodic trigger

For the sake of this tutorial, a manual trigger is sufficient. Once created, it will look like this:

List (of 1 element) of scheduled run

By clicking on the name of the scheduled run, you can list every run done and access logs or even trigger a new run. This last step can be done thanks to the R SDK using the following command:

create_pipeline_trigger(pipeline_id) # make sure to replace pipeline_id

Finally, we can see that the pipeline has done its job after some time. Indeed, we can see that it has a success status and that the deployed model has made 30,668 new predictions!

New prediction done, as seen from model deployment screen (tab prediction)

Output from the pipeline will also be visible directly into the “Data” tab of with the desired name (which is “prediction_pipeline_sdk”):

Generated dataset in our environment

In order to move forward, we could code our own custom component that gets data and makes a forecast every day or so, but this is well beyond the scope of this article and will be covered in a more detailed one 🧐

Now that we have a deployed experiment, fed with some data coming out of pipelines, the next step is to actually code a little R Shiny App that will be deployed and accessible through a custom URL 😎

Florian Laroumagne

About the author

Florian Laroumagne

Senior Data Scientist & Co-founder

Engineer in computer science and applied mathematics, Florian specialized in the field of Business Intelligence then Data Science. He is certified by ENSAE and obtained 2nd place in the Best Data Scientist of France competition in 2018. Florian is the co-founder of, a startup specializing in the automation of Machine Learning. He is now leading several applied predictive modeling projects for various clients.