Index

What is PageIndex?

Traditional vector-based RAG relies on semantic similarity. However, similarity ≠ relevance — and in retrieval, relevance is what truly matters. While vector-based RAG efficiently identifies broad thematic content, it often fails to retrieve the exact information required, particularly in specialized domains where many sections share similar language but differ in critical details.

Inspired by AlphaGo , we developed PageIndex, a reasoning-based RAG system that simulates how human experts navigate and extract knowledge from long documents through tree search. PageIndex first generates a hierarchical “table of contents” tree that maintains the original document’s logical flow and organization structure. After tree generation, a retrieval module navigates the tree to retrieve relevant content for downstream generation tasks. The retrieved results naturally include their structural locations within the tree, greatly improving retrieval accuracy and enabling efficient incorporation of human preferences or expert knowledge.

🌲 PageIndex Tree Generation

PageIndex generates a hierarchical “table of contents” tree that maintains the original document’s logical flow and organization structure. This LLM-optimized “table of contents” enables precise navigation and is ready for reasoning-based RAG.

No Vector DB Required: Tree structures are represented as lightweight JSON objects , eliminating the need for expensive vector databases.
No Chunking Required: Preserves natural document structure without artificial text splitting for better context retention.
Node Summary with Precise Page Referencing: Provides exact page references and summaries for precise information extraction.
Optimized for Long Documents: Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.

Here is an example output. See more example documents and generated trees.


...
{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation and Coordination",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}
...

🔎 PageIndex Retrieval

Once documents are transformed into hierarchical tree structures, the PageIndex retrieval module extracts relevant context from these trees. It leverages both LLM-based tree search and value-based tree search to perform efficient and accurate retrieval.

Specifically, given a query and a tree, the retrieval module performs a tree search and returns the most relevant nodes, with relevant paragraphs and the corresponding tree search trajectory. The retrieval process has the following properties:

No Top-K Selection Required Tree search automatically identifies all relevant tree nodes without the need for manual parameter tuning.
Transparent Node Trajectories Returns the complete search path through the tree structure, offering transparency and rich contextual information.
Exact Page References Every retrieved node includes precise page numbers and locations from the original document for verifiable information retrieval.
LLM-Ready Output Format Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.

Here is an example response from the PageIndex Retrieval API.


{
  "title": "Monetary Policy and Economic Developments",
  "node_id": "0004",
  "nodes": [
    {
      "title": "March 2024 Summary",
      "node_id": "0005",
      "relevant_contents": [{
          "physical_index": 10, 
          "relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
        }]
    },
    {
      "title": "June 2023 Summary",
      "node_id": "0006",
      "relevant_contents": [{
          "physical_index": 15, 
          "relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
        }]
    }]
}

💬 Help & Community

🤝 Join our Discord
📨 Leave us a message