Fusion Retrieval: Combining Vector Search and BM25 for Enhanced Document Retrieval
This project enhances document retrieval by combining semantic search (FAISS) and keyword-based ranking (BM25). It enables efficient search across PDF documents, using vector embeddings and language model-driven content generation for improved accuracy.
Project Outcomes
Requirements:
- →Google Colab or a local Python environment to run the code.
- →Python 3.8+ installed.
- →Libraries & Dependencies: langchain, langchain-community, langchain-openai, langchain-cohere
- →sentence-transformers, faiss-cpu, PyMuPDF, rank-bm25
- →openai, transformers, torch, accelerate, pypdf
- →Google Drive access (if running on Colab) for storing and retrieving PDFs.
- →A pre-trained language model (e.g., DeepSeek-R1-Distill-Qwen-1.5B ) for LLM-based generation.
- →Basic understanding of FAISS, BM25 and text embeddings for fine-tuning retrieval.
Project Description
The system processes PDFs using PyMuPDF, extracts text and splits it into manageable chunks with LangChain’s RecursiveCharacterTextSplitter. Text chunks are then embedded using Hugging Face’s MiniLM model and stored in FAISS for fast similarity searches. Additionally, BM25 scoring enhances retrieval by ranking documents based on keyword relevance. An LLM-powered hypothetical document is generated using DeepSeek-R1-Distill-Qwen-1.5B to improve responses, refining search results with context-aware insights. Retrieved documents are displayed with citations, making this system ideal for academic research, legal analysis, enterprise knowledge retrieval, and AI-driven Q&A solutions.

AI-driven document retrieval system using FAISS, BM25 and LLMs for fast, accurate search in legal, academic, corporate and research applications with citations.