[Solved] Filter langchain vector database using as_retriever search_kwargs parameter

Written by - Aionlinecourse1664 times views

Langchain is one of the most powerful frameworks used for natural language processing applications and tasks. It has a feature that gives this framework the ability to work with vector databases for efficient information retrieval. LangChain's interactions with vector databases enable developers to do complex searches. we can fine-tune our search queries to filter results more effectively by using 'as_retriever' method with the 'search_kwargs' parameter.

Solution 1:

If we are using Datastax Astra/Cassandra as VectorDB then it would be like this:

import cassio
cassio.init(token=os.environ["ASTRA_DB_APPLICATION_TOKEN"], database_id=os.environ["ASTRA_DB_ID"])

from langchain.vectorstores.cassandra import Cassandra
table_name = 'vs_investment_kb'
keyspace = 'demo'

CassVectorStore = Cassandra(
    session= cassio.config.resolve_session(),
    keyspace= keyspace,
    table_name= table_name,
    embedding=embedding_generator
)

retrieverSim = CassVectorStore.as_retriever(
    search_type='similarity',
    search_kwargs={
        'k': 4,
        'filter': {"source": file}
    },
)

# Create a "RetrievalQA" chain
chainSim = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retrieverSim,
    chain_type_kwargs={
        'prompt': PROMPT,
        'document_variable_name': 'summaries'
    }
)
# Run it and print results
responseSim = chainSim.run(QUERY)
print(responseSim)

You can check the full example here: https://github.com/smatiolids/astra-agent-memory/blob/main/Explicando%20Retrieval%20Augmented%20Generation.ipynb

Solution 2:

db.as_retriever() -> VectorStoreRetriever

This method call (as_retriever()) returns VectorStoreRetriever initialized from this VectorStore(db).

It supports these 2 Args:

search_type(Optional[str]): Defines the type of search that the Retriever should perform.
```
It can be "similarity" (default), "mmr", or "similarity_score_threshold".
```

search_kwargs(Optional[Dict]): Keyword arguments to pass to the search function.

it can include things like:

k: the amount of documents to return (Default: 4)
score_threshold: minimum relevance threshold for 'similarity_score_threshold'
fetch_k: amount of documents to pass to MMR algorithm (Default: 20)
lambda_mult: Diversity of results returned by MMR; 1 for minimum diversity and 0 for maximum. (Default: 0.5)
filter: Filter by document metadata

Examples:

# Retrieve more documents with higher diversity
# Useful if your dataset has many similar documents
db.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 6, 'lambda_mult': 0.25}
)

# Fetch more documents for the MMR algorithm to consider
# But only return the top 5
db.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 5, 'fetch_k': 50}
)

# Only retrieve documents that have a relevance score
# Above a certain threshold
db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={'score_threshold': 0.8}
)

# Only get the single most similar document from the dataset
db.as_retriever(search_kwargs={'k': 1})

# Use a filter to only retrieve documents from a specific metadata field
db.as_retriever(
    search_kwargs={'filter': {'Field_1':'S'}}
)

The 'as_retriever' method and 'search_kwargs' parameter help us to get precise and efficient information retrieval by filtering a LangChain vector database. Through this approach, we meet the specific search criteria as well as improve the relevance and accuracy of the search results. By using this advanced feature of LangChain, we can enhance our language model and make it more effective and responsive.

Thank you for reading the article.

Recommended Projects