Document Search by Description

For documents that don’t have metadata, you can use LLM-generated descriptions to help with document selection. This is a lightweight approach that works best with a small number of documents.

Example Pipeline

PageIndex Tree Generation

Upload all documents into PageIndex to get their doc_id and tree structure.

Description Generation

Generate a description for each document based on its PageIndex tree structure and node summaries.


prompt = f"""
You are given a table of contents structure of a document. 
Your task is to generate a one-sentence description for the document that makes it easy to distinguish from other documents.
    
Document tree structure: {PageIndex_Tree}
 
Directly return the description, do not include any other text.
"""

Search with LLM

Use an LLM to select relevant documents by comparing the user query against the generated descriptions.

Below is a sample prompt for document selection based on their descriptions:


prompt = f""" 
You are given a list of documents with their IDs, file names, and descriptions. Your task is to select documents that may contain information relevant to answering the user query.
 
Query: {query}
 
Documents: [
    {
        "doc_id": "xxx",
        "doc_name": "xxx",
        "doc_description": "xxx"
    }
]
 
Response Format:
{{
    "thinking": "<Your reasoning for document selection>",
    "answer": <Python list of relevant doc_ids>, e.g. ['doc_id1', 'doc_id2']. Return [] if no documents are relevant.
}}
 
Return only the JSON structure, with no additional output.
"""

Retrieve with PageIndex

Use the PageIndex doc_id of the retrieved documents to perform further retrieval via the PageIndex retrieval API.

💬 Help & Community

🤝 Join our Discord
📨 Leave us a message