Fusion Retrieval: Combining Vector Search and BM25 for Enhanced Document Retrieval

This project enhances document retrieval by combining semantic search (FAISS) and keyword-based ranking (BM25). It enables efficient search across PDF documents, using vector embeddings and language model-driven content generation for improved accuracy.

Project Outcomes

This AI
powered document retrieval system enhances search accuracy by combining FAISS (semantic search)
BM25 (keyword
based ranking) and LLM
generated content. It improves legal research
academic studies
corporate knowledge management and AI
driven search engines.
Can be implemented in legal firms for case law searches.
Useful for academic research to find relevant papers.
Helps corporations in retrieving internal policies and reports.
Supports journalists in fact
checking and source verification.
Assists medical professionals in retrieving patient case studies.
Beneficial for HR departments to quickly access company policies.
Used by law firms to analyze contracts and legal disputes.
Helps PhD students and scholars in finding references for theses.
Can be expanded for news agencies to search archives.
Enhances financial research by retrieving market reports.

Requirements:

  • Google Colab or a local Python environment to run the code.
  • Python 3.8+ installed.
  • Libraries & Dependencies: langchain, langchain-community, langchain-openai, langchain-cohere
  • sentence-transformers, faiss-cpu, PyMuPDF, rank-bm25
  • openai, transformers, torch, accelerate, pypdf
  • Google Drive access (if running on Colab) for storing and retrieving PDFs.
  • A pre-trained language model (e.g., DeepSeek-R1-Distill-Qwen-1.5B ) for LLM-based generation.
  • Basic understanding of FAISS, BM25 and text embeddings for fine-tuning retrieval.

Project Description

The system processes PDFs using PyMuPDF, extracts text and splits it into manageable chunks with LangChain’s RecursiveCharacterTextSplitter. Text chunks are then embedded using Hugging Face’s MiniLM model and stored in FAISS for fast similarity searches. Additionally, BM25 scoring enhances retrieval by ranking documents based on keyword relevance. An LLM-powered hypothetical document is generated using DeepSeek-R1-Distill-Qwen-1.5B to improve responses, refining search results with context-aware insights. Retrieved documents are displayed with citations, making this system ideal for academic research, legal analysis, enterprise knowledge retrieval, and AI-driven Q&A solutions.

Fusion Retrieval: Combining Vector Search and BM25 for Enhanced Document Retrieval

AI-driven document retrieval system using FAISS, BM25 and LLMs for fast, accurate search in legal, academic, corporate and research applications with citations.

$20$15.0025% off