Context Enrichment Window Around Chunks Using LlamaIndex

In the era of AI-powered search and retrieval systems, efficiently extracting relevant information from large text datasets is crucial. This project leverages LlamaIndex, OpenAI embeddings and FAISS to create an intelligent document search engine. By breaking text into context-aware sentence windows, the system enhances the accuracy of information retrieval while ensuring that responses are contextually relevant.

Project Overview

This project builds an AI-powered document retrieval system using LlamaIndex, OpenAI GPT-4o, FAISS and metadata-based processing to enhance search accuracy. It begins with PDF processing and text chunking, ensuring structured document handling. The system then sets up FAISS as a vector store and utilizes OpenAI embeddings for efficient similarity-based search.

For improved relevance, the IngestionPipeline applies SentenceWindowNodeParser, capturing context windows around key sentences. A custom retrieval function ensures responses are enriched with meaningful context. Finally, a comparison between standard and context-enriched retrieval demonstrates the advantages of context-aware search, making the system highly effective for semantic search, knowledge management and AI-driven Q\&A applications.

Prerequisites

Before running this project, ensure you have the following:

Python 3.8+ (for LlamaIndex and FAISS)
Google Colab or Local Machine (execution environment)
OpenAI API Key (for GPT-4o and embeddings)
FAISS (for storing and retrieving vectors)
LlamaIndex & Dependencies (install via pip)
PDF Documents (for processing and retrieval)

Approach

This project follows a structured approach to building an AI-driven document retrieval system using LlamaIndex, OpenAI GPT-4o and FAISS. First, it installs and configures necessary dependencies, including FAISS for vector storage and OpenAI embeddings for semantic search. The system then loads PDF documents, processes them into sentence-level chunks and applies the SentenceWindowNodeParser to capture contextual information. These processed text chunks are converted into vector embeddings and stored in FAISS, enabling efficient similarity-based retrieval. When a user queries the system, the query engine searches for the most relevant document fragments and applies MetadataReplacementPostProcessor to ensure context-rich responses. This context-aware approach enhances the accuracy and relevance of retrieved information, making it ideal for semantic search, AI-driven Q\&A and knowledge retrieval applications.

Workflow and Methodology

Workflow

Set up LlamaIndex, OpenAI, FAISS and other required libraries.
Use SimpleDirectoryReader to read PDF files.
Apply SentenceSplitter and SentenceWindowNodeParser to structure data.
Convert text chunks into vector embeddings using OpenAIEmbedding.
Save embeddings in a FAISS vector store for efficient retrieval.
Build a VectorStoreIndex from processed document nodes.
Use the query engine to find the most relevant document chunks.
Apply MetadataReplacementPostProcessor to ensure context-aware responses.
Print the retrieved document snippets along with their metadata.

Methodology

Data Collection: PDFs are loaded and extracted for processing.
Text Chunking: Documents are segmented into sentence-level windows to retain context.
Vectorization: Each text chunk is converted into a numerical embedding using OpenAI.
Storage & Indexing: Embeddings are stored in FAISS, optimizing similarity search.
Query Execution: User queries are matched against stored embeddings for retrieval.
Context Enhancement: Retrieved results are expanded with surrounding sentences for better comprehension.
Information Retrieval: The final response provides the most relevant and context-rich document fragment.

Data Collection and Preparation

Data Collection

This project begins with data collection by reading PDF documents from a specified directory using SimpleDirectoryReader. The system ensures that only files with the pdf extension are loaded, making the process efficient. Once collected, the documents are converted into raw text for further processing.

Data Preparation Workflow

Load PDF files using SimpleDirectoryReader.
Extract text and convert it into a structured format.
Segment text into smaller chunks with SentenceSplitter.
Apply SentenceWindowNodeParser to retain context.
Generate vector embeddings using OpenAIEmbedding.
Store embeddings in FAISS for efficient retrieval.

Code Explanation

Mounting Google Drive

This code mounts Google Drive to Colab, allowing access to files stored in Drive. The mounted directory is /content/drive, enabling seamless file handling.

from google.colab import drive
drive.mount('/content/drive')

Installing LlamaIndex and Dependencies

These commands install LlamaIndex and its necessary components, including OpenAI integration, FAISS for vector search and file readers. LlamaIndex helps build AI-powered retrieval systems by processing documents and embedding them for efficient search. FAISS improves search speed by storing and retrieving vectorized data efficiently.

!pip install llama-index-core
!pip install llama-index-llms-openai
!pip install llama-index-embeddings-openai
!pip install llama-index-vector-stores-faiss
!pip install faiss-cpu
!pip install llama-index-readers-file

Setting Up LlamaIndex with OpenAI and FAISS

This code initializes LlamaIndex by importing necessary libraries and setting up OpenAI for embeddings and language models. It loads the OpenAI API key from Google Colab or a .env file, ensuring secure access. Finally, it configures LlamaIndex to use GPT-4o for processing and FAISS for efficient vector-based search.

import os
import sys
import faiss
from dotenv import load_dotenv
from pprint import pprint
# Import LlamaIndex components
from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceWindowNodeParser, SentenceSplitter
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
try:
from google.colab import userdata
api_key = userdata.get("OPENAI_API_KEY")
except ImportError:
api_key = None  # Not running in Colab
if not api_key:
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
if api_key:
os.environ["OPENAI_API_KEY"] = api_key
else:
raise ValueError("❌ OpenAI API Key is missing\! Add it to Colab Secrets or .env file.")
# Global LlamaIndex settings for LLM & embeddings
EMBED_DIMENSION = 512
Settings.llm = OpenAI(model="gpt-4o")  # Using GPT-4o
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=EMBED_DIMENSION)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
print("LlamaIndex setup completed successfully!")

Loading PDF Documents with LlamaIndex

This code reads PDF files from a specified directory using SimpleDirectoryReader. It scans the folder at path, loads documents with a .pdf extension and stores them in documents. Finally, it prints the first document's content to verify the data.

path = "/content/drive/MyDrive/New 90 Projects/generative_ai_project/Context Enrichment Window Around Chunks Using LlamaInde"
reader = SimpleDirectoryReader(input_dir=path, required_exts=['.pdf'])
documents = reader.load_data()
print(documents[0])