Semantic Search Using Msmarco Distilbert Base & Faiss Vector Database

Imagine you are in search of the right movie! But instead of looking up certain keywords or using regular search engines, you could simply explain your needs to a system. It would work for you to know what you are looking for. That's called semantic search! In this project, we implement advanced transformer models used in natural processing.

With an extremely fast vector database named Faiss build an engine that is not limited to keyword matching. The system understands the meaning, the context, and the purpose of the system. This technology provides personalized and extremely fast results. Whether it is seeking appropriate goods and services about products and services with e-commerce, finding information health-wise, or looking for various forms of entertainment.

Project Overview:

This project “Semantic Search System Using Transformers and Vector Database” adopts advanced AI approaches to enhancing the information retrieval process. This is achieved by transformer models that represent the text content as vectors and searching for more similar content using Faiss. Which is a concept designed for efficient similar item search. Finding meaning is what matters most, not just matching words.

The transformer-based model training which uses Faiss for the vector database generation is fully embodied in this project. Then it turns towards the actual text materials which in this case happen to be movie scripts. It allows the system to perform a very fast and precise semantic search. The usage is broad from enhancing the search features of e-commerce or online shopping to providing customized recommendations in a health care system.

Prerequisites

The acquired skills will enable you to smoothly dive into the project and create an effective semantic search system with the tools and technologies available.

The fundamental theory and practice of the Python programming language, including basic knowledge of the use of libraries, and functions.
Knowledge of the basics of machine learning methods. Especially natural language will assist in understanding the transformer architecture.
A broad idea about the application of transformers for solving different NLP problems.
The use of Google Colab for using Jupyter Notebooks where execution of code is carried out in the clouds.
Concept of coding and constructing databases that store and retrieve high dimensional Vectors especially using models like Faiss.
Being familiar with the use of Pandas in constructing data frames as well as carrying out simple data cleansing and processing work.
Some knowledge of Faiss regarding large dataset similarity searching and indexing practices.
Access to GPU-supported environments for faster processing

Approach:

In this project, we first gather and then perform a step of preprocessing the given text data. Which will consist of eliminating missing values and duplicates. It’s this step that will improve the quality of your input. Then we pick a transformer model like a Sentence transformer. which takes the text and outputs high-dimensional vectors. These vectors are the semantic meaning of the text and hence are easier to process queries based on context instead of keywords. After the text becomes vectors, we index them through Faiss. It allows us to do fast similarity searches. The system can quickly compare vectors and find the most important as you enter a query.

When performing the query processing, it converts the search query into a vector. Then does a search using Faiss, finds the top k most similar results, and gives the users an accurate and context-based result. They test the whole system for speed and accuracy to make the results make sense. The model can also be fine-tuned if needed to enhance performance in certain domains, like movies, e-commerce, or healthcare. This approach will enable us to build a scalable, fast and semantic search system for meaningful personal results across various industries.

Workflow and methodologies:

The workflow and methodology for building the "Semantic Search System Using Transformers and Vector Database" are as follows:

Data Collection and Preprocessing:
Step 1: Collect a text dataset such as movie plot descriptions in a structured format (CSV).
Step 2: The dataset will be cleaned by filling out the rows with missing data, removing duplicates, and removing any rows that do not contain useful data.
Step 3: Now you have the text length and structure of data to analyze and structure the data for efficient processing and for further steps.
Transformers and Text Embedding
Step 1: Select a pre-trained transformer model out of the SentenceTransformer library like msmarco-distilbert-base-dot-prod-v3 as shown.
Step 2: Clean the text data then convert them into high dimensional vectors.
Step 3: A fixed-size vector is learned from each movie plot to capture the underlying meaning of it to improve search results quality.
Indexing with Faiss
Step 1: We create an index with the Faiss library to vectorize the data. Because this project requires a fast similarity search.
Step 2: Faiss index maps vectorized data to their corresponding text entries then adds the vectorized data to the Faiss index.
Step 3: The index can be used for later use to speed up those searches when performing on a large dataset.
Semantic Search
Step 1: For a user inputting in their search query, convert it into a vector with the same transformer model.
Step 2: On the Faiss index, perform a similarity search on the query vector. And find those closest to the query vector using the stored vectors.
Step 3: Vector proximity is used to retrieve the top k most similar results, giving the user relevant results based on the query's semantics.
Evaluation and Optimization
Step 1: Assess the precision of the system through the execution of different queries.
Step 2: Adjust and retrain the transformer model if needed.
Step 3: Optimize the system for scalability.

Methodology:

Semantic Embedding: Represent a piece of text in the form of a meaningful vector using transformer models.
Efficient Indexing: Store and retrieve high-dimensional vector embeddings using Faiss.
Similarity Search: Use the inner product or cosine distance between the semantic vectors to retrieve the nearest neighbors.
Customizable and Scalable: Fine-tune the model to smaller domains and expand it to larger datasets to boost its overall efficacy.

Data Collection and Preparation:

Data collection workflow:

Collect a reliable dataset related to the project. Let’s assume movie plots from Wikipedia or some structured database.
Get the data as a structure such as CSV or JSON where you obtain fields like Title, Plot, Release Year, Genre, etc.
Make sure the dataset is in the same format. With each entry covering the entire content and being relevant for the task.
We will load the dataset and look at the numbers of rows, columns, and types of data to have an overview.
Save it locally or on cloud storage(Google Drive) for quick access.

Data Preparation workflow:

Load the dataset into a data frame to work with more easily.
Avoid issues during model processing by eliminating any rows having missing data. Remove duplicate rows to ensure using unique columns.
Then optionally do anything else you want to do to your text, like removing punctuation or characters that you don’t want.
Determine the right sequence length for the model by calculating the word count of each plot.
Using a pre-trained transformer model it converts the cleaned text data into high dimensional vectors.
Make sure that the vectorized data has the right format (a NumPy array) for indexing by the Faiss database.
Cleaned datasets and vectors are stored for use in the search system, and later retrieved.

Code Explanation

STEP 1:

Mounting Drive and Installing Required Libraries

Mounting Drive

This code shows how to connect your Google Drive account to a Colab workspace. It helps in accessing the files available in the user’s Google Drive by making it present in a particular folder (which is ‘/content/drive’).

from google.colab import drive
drive.mount('/content/drive')

Install Transformer

By executing this command, the sentence-transformers library gets installed by pip. The sentence-transformers library focuses on dealing with transformation models. Specifically for the text embedding, semantic similarity, and sentence classification tasks.

!pip install sentence-transformers

This command provides information about the NVIDIA GPU, including its current state, usage stats, and memory details
.

!nvidia-smi

STEP 2:

Importing Libraries and Initializing the Model

Importing Libraries and Initializing Model

This block loads some libraries for data manipulation, visualization, and NLP tasks. Using the msmarco-distilbert-base-dot-prod-v3 pre-trained SentenceTransformer model it initializes the model. Then converts text into semantic vectors.

# Importing the pandas library and aliasing it as pd
import pandas as pd
# Importing the time module
import time
# Importing tqdm for progress bars
from tqdm import tqdm
# Importing seaborn for data visualization
import seaborn as sns
# Importing numpy and aliasing it as np
import numpy as np
# Importing TextBlob for text processing tasks
from textblob import TextBlob
# Importing matplotlib.pyplot for plotting
import matplotlib.pyplot as plt
# Importing SentenceTransformer from the sentence_transformers library
from sentence_transformers import SentenceTransformer
# Initializing the SentenceTransformer model with the 'msmarco-distilbert-base-dot-prod-v3' pre-trained model
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')

STEP 3:

Optimized Data Loading and Structural Overview of Dataset

The code first reads a CSV file from Google Drive and loads it into a Pandas DataFrame, enabling efficient data handling by using memory mapping to reduce memory consumption, which is especially useful for large datasets. Following this, it provides a summary overview of the dataset, displaying the number of entries, column names, data types, and the count of non-null values for each column. This overview assists in understanding the dataset's structure and the completeness of its data.

data = pd.read_csv('/content/drive/MyDrive/aionlinecourse/wiki_movie_plots_deduped.csv',memory_map=True)
data.info()

Memory Management and Data Selection

This code snippet imports the garbage collection (gc) module. It only selects the ‘Title’ and ‘Plot’ columns and stores them in a new DataFrame df. Thereafter, it erases the earlier data to release memory space and invokes garbage collection to minimize resources.

# Importing the gc module for garbage collection
import gc
# Selecting only the 'Title' and 'Plot' columns from the DataFrame 'data' and assigning it to 'df'
df = data[['Title', 'Plot']]
# Deleting the 'data' DataFrame from memory
del data
# Performing garbage collection to free up memory
gc.collect()

STEP 4:

Data Cleaning and Plot Length Distribution Analysis

This code cleans the DataFrame df by removing any rows containing NaN values and eliminating duplicate entries based on the ‘Plot’ column, ensuring a refined dataset for further processing. Additionally, it analyzes the length of each movie plot by calculating the word count and adds this as a new column, doc_len, in the DataFrame. Using the mean and standard deviation of plot lengths, it determines the maximum sequence length. The code then visualizes plot lengths with a distribution plot, marking the calculated maximum sequence length with a vertical dashed line. The plot includes legends and labels, effectively illustrating the text length distribution across the dataset.

# Drops rows with missing values in any column and modifies 'df' in place
df.dropna(inplace=True)
# Drops duplicate rows based on the 'Plot' column and modifies 'df' in place
df.drop_duplicates(subset=['Plot'], inplace=True)
# Calculates the length of each plot in terms of the number of words and assigns the result to a new column 'doc_len' in the DataFrame 'df'.
df['doc_len'] = df['Plot'].apply(lambda words: len(words.split()))
# Calculates the maximum sequence length by rounding the mean of 'doc_len' plus its standard deviation, and converts it to an integer.
max_seq_len = np.round(df['doc_len'].mean() + df['doc_len'].std()).astype(int)
# Plots a distribution plot (histogram and kernel density estimate) of the 'doc_len' column from the DataFrame 'df', with a blue color and labeled as 'doc len'.
sns.distplot(df['doc_len'], hist=True, kde=True, color='b', label='doc len')
# Adds a vertical dashed line at the position of 'max_seq_len' on the plot, indicating the maximum sequence length, with a black color and labeled as 'max len'.
plt.axvline(x=max_seq_len, color='k', linestyle='--', label='max len')
# Sets the title of the plot as 'plot length'.
plt.title('plot length')
# Displays the legend on the plot.
plt.legend()
# Displays the plot.
plt.show()

STEP 5:

Installing and Setting Up Faiss

Installing Faiss Library

This command installs Faiss with GPU support. This is a library designed for efficient similarity search and clustering of dense vectors, thereby speeding up vector operations.