Prevision.io News: First Ever Pay-As-You-Go AI Management Platform Prevision.io Launches on Google Cloud.

Netflix Search Engine

This blog post is the second in the series dedicated to introducing Text Similarity Experiment in Prevision.io. If you haven’t already read the first one about defining the concepts of NLP, Text Similarity, group of tasks in this domain and methodology, you can find it here  

Below, we will show how the Prevision.io platform provides a complete solution to train a Text Similarity Experiment and how you can easily deploy a web interface using a trained model.

Introduction

This article aims to develop a text similarity project, from design, training experiments, to operational use.

We will use the Netflix Catalog dataset as an index base and we will use user queries as a basis for improvement. You can find this dataset here. All Prevision.io public datasets are available here.

Pre-required

To train a Text Similarity model with Prevision.io, you need an index file,containing your catalog/text database, and optionally a query file to improve the learning quality of the model.

The mandatory index file must contain at least 2 columns (at least 100 rows, there is no restriction on the maximum number of rows):

  • an identifier (example: PMID_17976362), Each identifier must be unique,
  • the textual content of your article (example: “Sarcomas understood a heterogeneous group of mesenchymal neoplasms. They can be grouped into 2 general categories, soft tissue sarcoma and primary bone sarcoma, which have different staging and treatment approaches”)

Your file must be in csv format with the comma separator (if you put another separator like the semicolon, the platform will recognize it), and the first line must contain the names of the columns. It is desirable to put your text between double quotes to separate the content from the rest as in the example above (and to avoid reading errors)

Remarks:

  • If your file contains other columns, they will not be taken into account by the system.
  • you can name your columns freely, you will have to make the correspondence when you define the training parameters of the experiment

The optional query file must contain at least 2 columns (at least 100 lines, there is no restriction on the maximum number of lines):

  • (mandatory) the request made by the user, it is advisable to put the text of the request between double quotes,
  • (mandatory) the identifier of the index corresponding to the request
  • (optional) a unique identifier relating to the request.

If you do not provide a query file to train the textual similarity search algorithms, Prevision.io will generate one using data augmentation, replacement or deletion of words from the index file corpora.

Remarks:

  • If your file contains other columns, they will not be taken into account by the system.
  • Several queries, with different searches, can point to the same article.
  • The more different queries you have, the more models will be able to specialize in responding to user queries.

Train a Text Similarity Experiment

Text Similarity parameters

Even if considered as a training type for tabular data type, text similarity experiments are particular and need specific training options. Text similarity models allow retrieval of textual documents from a query. For example, from the “Red shoes for girls” query, your model should return a corresponding item.

Import Dataset View

In order to train a text similarity model you must have a train set (dataset dropdown menu) with :

  • a description column : some column with text that describes items you want to query (Description column dropdown menu)
  • an id column : only column with unique ID could be selected (ID column)

To get better evaluation you should have a query dataset (queries dropdown menu) with:

  • a textual column containing user queries that should have match with some item description (query column dropdown menu)
  • a column with the id of the item whose description should have match the query (Matching ID column in the description dataset dropdown menu)

Your queries dataset could have its own ID column (ID Column dropdown menu)

Note that the drop down to select column only appears when you have selected a Dataset and/or a Queries.

A dataset with items and their description

A dataset with user queries and the item id that should have match

You then have to select a metric:

  • Accuracy at k: Is the real item corresponding to a query present in the search result, among the k most similar items returned? The value is a percentage calculated on a set of queries. As seen previously, if you did not provide a query file when launching your experiment, Prevision.io will generate one using data augmentation, replacement or deletion of words from the index file corpora. It will display the performance on this automatically generated query file.
  • Mean Reciprocal Rank (MRR) at k: Similar to accuracy at k. However the score for each query is divided by the rank of appearance of the corresponding item. Example: If for a query the corresponding item appears in third position in the returned list, then the score will be ⅓ . If it appears in second position the score will be ½, in first position the score will be 1, etc. https://en.wikipedia.org/wiki/Mean_reciprocal_rank
  • K results: the number of query-like items that the tool must return during a search. Value between 1 and 100.

Text similarities models options

Available algorithms for text similarity models

Text similarity module has its own modeling techniques and is composed of 2 kinds of models :

  • embedding model to make a vector representation of your data
  • search models to find proximity between your queries and product database

Embedding model / text/document vectorization

Term Frequency – Inverse Document Frequency (TF-IDF): Model representing a text only according to the occurrence of words. Words not very present in the corpus of texts will have a greater impact. Use it to focus on lexical similarity (more information at https://fr.wikipedia.org/wiki/TF-IDF)

Transformer: Model representing a text according to the meaning of words. In particular, the same word will have a different representation according to the other words surrounding it. Use it to focus on semantic similarity (more information at https://en.wikipedia.org/wiki/Transformer_(machine_learning_model). Transformer that has been trained upstream on a large volume of data, but has not been re-trained on the corpus in question.

Fine-tuned transform: A transform that has been trained on a large volume of data and then re-trained on the text corpus in question.

Search models

Brute Force: Exhaustive search, i.e. each query is compared to the set of item descriptions. Remark: use with caution when working with a high number of samples.

Locality sensitive hashing (LSH): exhaustive search. Vectors are compressed to speed up distance calculations. Only available with Transformers embedders. https://fr.wikipedia.org/wiki/Locality_sensitive_hashing

Cluster Pruning: non-exhaustive search. Item descriptions are grouped by cluster according to their similarity. Each query is compared only to the queries of the closest group. Only available with TF-IDF embedder. https://nlp.stanford.edu/IR-book/html/htmledition/cluster-pruning-1.html

Hierarchical k-means (HKM): non-exhaustive research. The idea is the same as for the previous model, but the method used to group the items is different. Only available with Transformers embedders.

InVerted File and Optimized Product Quantization (IVFOPQ): non-exhaustive search. The idea is the same as for the two previous models, but the method used to group the items is different. Vectors are also compressed to speed up distance calculations. Only available with Transformers embedders. https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf

Please note that in order to guarantee the performance of IVF-OPQ models, a minimum of 1000 unique IDs in the train dataset is required.

Text similarity Preprocessing

Feature engineering for text similarities

Several preprocessing options are available:

  • Language: you can force the training dataset language to english or french or, let the platform determines by itself between these two languages
  • Stop words treatment: you can choose if the platform has to ignore or consider the stopwords during the training. As for the language, you can also let the system makes it own decision by selecting “automatic”
  • Word stemming: stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
  • Ignore punctuation: by activating this option, the punctuation will not be considered during the training

Watching my Text similarity experiments

Text similarity experiments will be available in the same way that standard tabular data experiments, by clicking on it the list of your experiments yet it has some specificities.

First one, when you select a model, is the model evaluation chart. It shows the performance evolution along the expected rank

Text similarity models

Second difference is the prediction tab, which is slightly different from other:

Text similarity predictions

Trade-off between accuracy and unit predict time

Prevision proposes the graph above which summarizes the trade-off between the performance of the model (on the y-axis) and the time to calculate a prediction (on the x-axis). To choose the model corresponding to your needs, answer these 2 questions:

  1. Are you going to query the model query by query or in batch?
  2. Is the response time an important factor?

In summary:

  • in the context of a unit prediction, if you need real time fast & accurate enough answers, you should favor models with short response times,
  • in the context of batch prediction, in which time doesn’t matter much, you can choose the most efficient model.

Deploy and Use a Text Similarity Model

To deploy a text similarity model in Prevision.io, click on deployment in the left menu, then on the new deployment button.

Once on the page Deploy a new experiment, You must fill in the following information:

Deploy a new experiment View

  • Deployment Name: this name will be used as a subdomain url to access your text similarity model query service. For example, https://my-similarity-model.cloud.prevision.io/
  • Description (optional): Although this field is optional, we advise you to list the parameters specific to the training of your model so that you can find your way around later. You can note the name of the training dataset / queries dataset. You can mention a date. You can mention the learning parameters (metric, value of k, …),
  • Select an experiment: in the drop-down list, select the experiment you want to deploy. Tip: Names are ordered from most recent to oldest.
  • Select a version: you must specify which version of the experiment you want to deploy. Often the first version is not the one you want to keep, because the data scientist often needs to iterate to get a good model. Once again, it is important to trace the information relating to the different versions so that you can find your way around later.
  • Select a model to deploy: in this drop-down list, all the models trained in the experiment you have selected will appear. Tip: the models are ordered by decreasing score according to the metric defined in the experiment.
  • Select a challenger (optionally): you could select another version and another model of the same experiment that will be deployed as a challenger model. This one will be called each time your main model is called and it’s response will be recorded next to those of the main model in order to compare them and maybe switch them. This approach is extremely useful when you are working on a new version or retrained model before actually making the switch.
  • When deploying a new experiment you need to grant access:
    • public: everybody can access your model without even need to authenticate
    • Instance collaborators: everybody on the instance can access your model (note: every one that can access xxx.prevision.io is considered to belong to the xxx instance)
    • Project collaborators: only your project’s collaborators can call your model

Once you click on the deploy button, the model you chose will be deployed. You can check its status in the list of deployments or in the deployment page. It takes between 2 and 5 minutes for the model to be deployed with all the components specific to its use, including its monitoring

Deployment Model Management Page View

Some information on the management page of a deployed model:

  • On the first General tab, there is information relating to the deployed model, the application url to query the model and a link to documentation for developers. Below, are listed the main and challenger models as well as the API information for a remote query. To do so, click on the generate new key button and retrieve your personal connection information (Client ID and Client secret)
  • On the second Monitoring tab, you will find the usage history of your deployed model. how many times it was called, how many errors, what is the average response time.
  • On the third tab, named Predictions, you will find the history of batch predictions made
  • The fourth Versions tab allows you to keep track of the different versions that you have deployed over time. With one click, you can change model, model version in production. Tip: At any time, if you have a problem with the production model, you can revert to a previously deployed model. All that without any service interruption!
  • Last tab, Usage, traces the RAM and CPU consumption of your deployed model over time

Let’s go back to the General tab and click on the app link https://netflix-search-engine.cloud.prevision.io/, you arrive at the query page of your textual similarity search model

Netflix Search Engine Query View

To do a search, it’s very simple, you must:

  • fill in the search query (avoid empty words / stop words)
  • specify the number of results to return.

Prevision.io will put in bold the words corresponding to your request. The returned results are ranked in descending order according to the similarity score, from the most relevant to the least relevant, within the limit of the number of expected results. The similarity score is the proportion of words found from your query in the returned results. The value of the similarity score is not necessarily of interest. For small texts, the similarity scores can approach 1. For large corpora, like articles, the “best score” may be limited to 0.1, which is quite normal.

Conclusion

In this second article, we carried out a text similarity experiment in Prevision.io, from importing datasets, training one of several text similarity experiments to production and operational use.

 

For developers who wish to integrate the similarity text model into their application (website, mobile application, …), we provide a github directory https://github.com/previsionio/prevision-nlp-query-app with useful resources, and examples of js, node, python scripts , curl and swift to query the model remotely and retrieve the texts and scores of the most similar articles.

 

I would like to warmly thank Florent Rambaud, researcher and developer at Prevision.io without whom this article and the text similarity feature of Prevision.io could not have seen the light of day.

Mathurin Aché

About the author

Mathurin Aché

Expert Data Science Advisory