Introduction to Text Similarity

This blog post series will consist of two blogs: 

  • The first blog post below will explain the concepts of NLP, Text Similarity and group of tasks in this domain.
  • The second blog post will show how the platform is a complete solution to train a Text Similarity Experiment and how you can easily deploy a web interface using a trained model.


Text similarity is used to discover the most similar texts. It used to discover similar documents such as finding documents on any search engine. We can also use text similarity in document recommendations.In this first article, we will define what NLP is, how the subject of text similarity is a subset of NLP, then we will discover text similarity use cases.

What is NLP?

NLP means Natural Language Processing. It is a field of artificial intelligence focused on the automatic processing of the human language. NLP relies on the use of unstructured data and the needs of relevant representations of the documents depending on the use case.

NLP can be defined as the automatic processing of the human language. It is a field of Artificial Intelligence (AI) aiming at processing and analyzing large amounts of natural language data. The fields of application are very broad: text classification, sentiment analysis, text summarization, translation, named entity recognition,  information retrieval, paraphrase detection, textual entailment, question answering, etc…

NLP entered a new era with the evolution of computing techniques. Today, general-purpose computing on graphics processing units (GPU) greatly eases the parallelization of calculations. This led to the explosion of the deep learning models which benefit from the acceleration of this new processing. It is characterized in NLP with the emergence of the  Transformers models, which learn powerful representations of natural language and outperform humans at some tasks such as question answering.

NLP Use Cases

In NLP, you can find many use cases:

  • Text classification: Classification of documents into a fixed number of predefined categories. This is the most frequent use case in NLP.
  • Sentiment analysis, also known as opinion mining or emotion AI. Is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
  • Machine translation is the task of automatically converting one natural language into another, preserving the meaning of the input text, and producing fluent text in the output language.
  • Text summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.
  • Named entity recognition (NER) is a process where a sentence or a chunk of text is parsed through to find entities that can be put under categories like names, organizations, locations, quantities, monetary values, percentages, etc.
  • Information retrieval (IR) is defined as the process of accessing and retrieving the most appropriate information from text based on a particular query given by the user, with the help of context-based indexing or metadata.
  • Paraphrase detection aims to classify whether sentence pairs are paraphrases of each other or not.
  • Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.
  • Text similarity has to determine how ‘close’ two pieces of text are both in surface closeness [lexical similarity] and meaning [semantic similarity]. We will detail characteristics and specificities of text similarity.

What is text similarity in NLP?

Assessing the degree of similarity between texts is a specific task in NLP. The objective is to evaluate whether words, sentences, paragraphs or documents are similar to each other.  

Lexical similarity

Do we find the same words in the two documents?  We want to compare how close two documents are on the surface, disregarding grammar or word order.  In this case the two sentences: the cat ate the mouse and the mouse ate the cat food are very similar, with 5 of 6 unique words present in both, although their meaning is different. On the  contrary, Obama speaks to the media in Illinois and The President greets the press in  Chicago will be considered very different even though their meaning is the same.  This type of similarity is required for product search engines or product redundancy removal.

Semantic similarity

Do the two documents have the same meaning?  It requires a deeper under-standing of the texts, by considering the lexicon (with the synonyms), the syntax, the word order and the context.  It is essential for use cases such as FAQ support or paraphrase detection.

Text similarity – Use cases

I present to you a quick overview of the use cases of text similarity.

  • FAQ support: when someone asks a question on the FAQ, find if similar questions have already been answered. In this use case, we will privilege the semantic similarity.
  • Search engine: find in a catalog or database the most similar product(s) to a query.  This approach is mainly lexical similarity and words semantic for synonyms
  • Clustering: group similar texts from a corpus. Semantic similarity is the process to follow.
  • Law / Insurance: for legal documents, estimate the risks associated with a new contract given previous known documents. It is advisable to use semantic similarity for long documents with different contexts and lexical similarity for short technical descriptions.
  • Redundancy Removal: get rid of duplicate product, person listings in a database. Lexical similarity for short technical descriptions is preferred in this kind of context.
  • Paraphrase detection: recognizing whether two texts have the same meaning. Since it is a question of comparing sentences with each other, we will use semantic similarity.


Differents steps in Text Similarity

In order to perform text similarity using NLP techniques, these are the standard steps to be followed:

  • Corpus analysis
  • Texts preprocessing
  • Texts embedding
  • Similarity metrics
  • Similarity search

Let’s detail each step of the process.

Corpus Analysis

One  way  to  automatically  detect  if  a  corpus  relies  on  semantic  or  lexical  similarity  is  to  analyse  its grammatical  decomposition.   It  can  be  achieved  by  looking  at  the  distribution  of  the  part-of-speech  tags (POS tags).  They are the grammatical instances present in the texts. A textual dataset for a semantic use case will contain natural language data, that is full sentences.  On the other hand when working on lexical similarity, the corpus will only be made of groups of words and adjectives, with no real sentence. Thus if the documents do not contain many verbs or auxiliaries, the similarity will most likely rely on the lexicon rather than the semantic.

Texts Pre-Processing

In day to day practice, information is being gathered from multiple sources: be it web, document or transcription from audio. This information may contain various types of garbage values, noisy text, encoding, etc.. This needs to be cleaned in order to perform further tasks of NLP. In this preprocessing phase, it should include removing non-ASCII values, special characters, HTML tags, stop words, raw format conversion and so on.

Texts embedding

When working with textual data, we might just treat each text as a collection of  words and compare two documents with respect to their intersection of common words. However, more recent techniques assign continuous vectors to the document tokens  or  directly  to  the  documents in order to obtain a convenient representation of the textual data for further tasks.

Similarity metrics

Different metrics are available to measure the similarity between two documents depending on how the texts were preprocessed.


Bag-of-words similarity indices


This metric category treats each document as the collection of its words.  In NLP this is called the bag-of-words representation of a text.  Given two sets of words (e.g.two sentences) A and B, several indices exist to assess their similarity with a score from 0 (dissimilar) to 1 (very similar):

Bag-of-words similarity metrics

Continuous vectors metrics

Continuous similarity metrics


Many metrics are available to assess the similarity between two documents when they are represented by continuous vectors X and Y.

The following is a list of usual metrics:

  • Euclidean distance: it is the most famous distance metric. However, it  becomes  less  reliable  in  high     For  instance  the  ratio  between  the  nearest  and  farthest points for a wide variety of data distributions approaches 1 with the increase of the dimension, i.e. the points are  uniformly  distant  from  each  other.   Thus  this  distance  becomes  less  informative  to  compare  the closeness of vectors.
  • Minkowski distance: it is  the  general  formula  of  the  Euclidean distance.  Fixing p <2, can lead to better results in high dimensions.
  • Cosine Similarity: it computes the cosine of the angle between the two vectors. It is a judgment in orientation and not in magnitude, thus remains reliable in high dimensions.

Similarity Search

Once we have selected an embedding method and a similarity metric, we can focus on similarity search.

The search duration breaks down into two parts: 

  • computing the distances between a query text and a corpus of texts,
  • sorting these distances to extract the most similar documents.


A straightforward approach is the brute force approach, which consists in comparing the text with all the texts to finally retain the k most similar.


When the search volumes are substantial, from several hundred thousand documents, different algorithms can be used to speed up the search: cluster pruning, Locality Sensitive Hashing (LSH), Hierarchical k-means (HKM), Product Quantization (PQ). These methods will be detailed in the next article.



In this first article, we have just introduced natural language processing, text similarity, subcategories of text similarity, use cases based on text similarity and we concluded by presenting the pipeline used in


In the next article, we will see how from end to end, deal with a use case of text similarity with

Mathurin Aché

About the author

Mathurin Aché

Expert Data Science Advisory