Project Overview
The system processes PDFs using PyMuPDF, extracts text and splits it into manageable chunks with LangChain’s RecursiveCharacterTextSplitter. Text chunks are then embedded using Hugging Face’s MiniLM model and stored in FAISS for fast similarity searches. Additionally, BM25 scoring enhances retrieval by ranking documents based on keyword relevance. An LLM-powered hypothetical document is generated using DeepSeek-R1-Distill-Qwen-1.5B to improve responses, refining search results with context-aware insights. Retrieved documents are displayed with citations, making this system ideal for academic research, legal analysis, enterprise knowledge retrieval, and AI-driven Q&A solutions.
Prerequisites
- Google Colab or a local Python environment to run the code.
- Python 3.8+ installed.
- Libraries & Dependencies:
- langchain, langchain-community, langchain-openai, langchain-cohere
- sentence-transformers, faiss-cpu, PyMuPDF, rank-bm25
- openai, transformers, torch, accelerate, pypdf
- Google Drive access (if running on Colab) for storing and retrieving PDFs.
- A pre-trained language model (e.g., DeepSeek-R1-Distill-Qwen-1.5B) for LLM-based generation.
- Basic understanding of FAISS, BM25 and text embeddings for fine-tuning retrieval.
Approach
The system starts by processing PDF documents with PyMuPDF (fitz) to extract the text. This text is then divided into manageable chunks using LangChain’s RecursiveCharacterTextSplitter. To maintain clean formatting, tab characters (\t) are replaced with spaces. The text chunks are transformed into vector embeddings using Hugging Face’s MiniLM model and stored in FAISS, which enables quick semantic similarity searches. Additionally, BM25 ranking is utilized to improve keyword-based retrieval, ensuring that documents are ranked according to both contextual meaning and exact keyword matches. A hybrid retrieval approach merges FAISS-based semantic search with BM25 scoring for better accuracy. To further enhance the results, an LLM-powered hypothetical document is generated using DeepSeek-R1-Distill-Qwen-1.5B, offering context-aware search improvements. Finally, the retrieved documents are presented with citations, ensuring transparency and traceability for research, enterprise search and AI-driven Q\&A applications.
Workflow and Methodology
Workflow:
Step 1: Data Ingestion
- Manually upload PDF files or load them from Google Drive in Colab.
Step 2: Text Extraction & Cleaning
- Utilize PyMuPDF (fitz) to extract text from PDFs.
- Replace tab characters (\t) with spaces for better formatting.
Step 3: Text Splitting & Chunking
- Apply LangChain’s RecursiveCharacterTextSplitter to divide text into manageable chunks while maintaining context.
Step 4: Embedding Generation & Vector Storage
- Convert text chunks into vector embeddings using Hugging Face’s MiniLM model.
- Store these embeddings in FAISS, enabling fast semantic similarity searches.
Step 5: Keyword-Based Retrieval with BM25
- Compute BM25 scores to rank documents based on keyword similarity.
Step 6: Hybrid Retrieval (FAISS + BM25)
- Perform FAISS-based similarity search to retrieve semantically relevant documents.
- Refine search results using BM25 keyword ranking to ensure both semantic and keyword relevance.
Step 7: LLM-Powered Hypothetical Document Generation
- Use DeepSeek-R1-Distill-Qwen-1.5B to generate a hypothetical document based on the user’s query.
- Use this document to enhance retrieval by adding context-aware results.
Step 8: Retrieval with Citations
- Attach source citations to retrieved documents, including page numbers (if available).
- Display results in a structured format for easy interpretation and traceability.
Methodology
- PDF Processing and Text Cleaning: Get text from PDFs and format it by removing a few characters that are not necessary to chunk it into smaller parts.
- Vector Embeddings: Translate and encode these text chunks as packed vectors utilizing MiniLM for embedding conversion.
- FAISS-Based Semantic Search: It embeds the embeddings in FAISS and retrieves them through similarity-based searching.
- BM25 Keyword Matching: The BM25 ranking is used to advance searches by keyword relevance.
- Hybrid Retrieval: A combination of FAISS search in similarity and BM25 scoring to improve accuracy.
- Hypothetical Answer Generation: The DeepSeek-R1-Distill-Qwen-1.5B is employed to produce context-rich documents that enhance retrieval quality.
- Citations of Results: Create source references in retrieved documents for traceability and validation.
Data Collection and Preparation
Data Collection
- Users manually upload PDF files through Google Colab using files.upload().
- Alternatively, files can be loaded from Google Drive by mounting them in Colab.
Data Preparation Workflow
- Upload PDFs via Google Colab or Google Drive.
- Extract text using PyMuPDF (fitz).
- Clean text by removing tab characters (
\t
). - Split text into chunks using LangChain’s RecursiveCharacterTextSplitter.
- Generate embeddings with Hugging Face’s MiniLM model.
- Store embeddings in FAISS for fast semantic search.
This process ensures clean, structured and retrievable data for the system.