Skip to Content
Introduction

What is PageIndex?

PageIndex is a reasoning-based retrieval system that simulates how human experts navigate and extract knowledge from documents. Rather than relying on vector-based semantic similarity search, it transforms documents into hierarchical tree structures and conducts structured tree searches to identify and retrieve the most relevant information.

PageIndex Workflow

Get started with our cookbook for a quick hands-on tutorial.

PageIndex Tools

PageIndex have four available tools:

  • PageIndex OCR: converts PDFs to markdown with global structure preserved, ready for tree generation.
  • PageIndex Tree Generation: generates hierarchical tree indexes for documents.
  • PageIndex Retrieval: performs retrieval via tree search.
  • PageIndex MCP: integrate PageIndex with your LLM agents.

📑 PageIndex OCR

Classic OCR systems analyze each page in isolation — dividing it into blocks, processing each block independently, and ultimately returning a flat, fragmented output with structural errors and loss of document hierarchy. PageIndex OCR leverages the context window of large vision-language models and treats the entire document as a cohesive, structured whole. It can not only generate accurate page-level markdown content, but also preserve the hierarchical organization of content — titles, sections, subsections, bullet lists, tables, references — across page boundaries.

  • Accurate Page-level Markdown Content: PageIndex OCR can transform each page into LLM-ready markdown text.

  • Preserving Multi-page Structure: PageIndex OCR preserves the hierarchical structure of the whole document, significantly improving markdown rendering and document representation.

  • Fast Processing: PageIndex OCR handles long documents efficiently and scales to long context windows without compromising speed.

🌲 PageIndex Tree Generation

PageIndex generates a hierarchical “table of contents” tree that maintains the original document’s logical flow and organizational structure. This LLM-optimized “table of contents” enables precise navigation and is ready for reasoning-based RAG.

  • No Vector DB Required: Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.

  • No Chunking Required: Preserves natural document structure without artificial text splitting for better context retention.

  • Node Location and Summary: Provides node page number and summary for precise information navigation.

  • Optimized for Long Documents: Tree generation is optimized for financial reports, legal documents, and technical manuals that exceed LLM context limits.

Here is an example output format, see our cookbook for a practical example.

... { "title": "Financial Stability", "node_id": "0006", "page_index": 21, "text": "The Federal Reserve maintains financial stability through comprehensive...", "prefix_summary": "This section discusses...", "nodes": [ { "title": "Monitoring Financial Vulnerabilities", "node_id": "0007", "page_index": 22, "text": "The Federal Reserve's monitoring focuses on identifying...", "summary": "This section discusses..." }, { "title": "Domestic and International Cooperation and Coordination", "node_id": "0008", "page_index": 28, "text": "In 2023, the Federal Reserve collaborated internationally...", "summary": "This section discusses..." } ] } ...

🔎 PageIndex Retrieval

Once documents are transformed into hierarchical tree structures, the PageIndex retrieval module extracts relevant context from these trees. It leverages both LLM-based tree search and value-based tree search to perform efficient and accurate retrieval.

Specifically, given a query and a tree, the retrieval module performs a tree search and returns the most relevant nodes, along with relevant paragraphs and corresponding tree search trajectories. The retrieval process has the following properties:

  • No Top-K Selection Required Tree search automatically identifies all relevant tree nodes, avoiding manual parameter tuning and arbitrary cutoffs in retrieval.

  • Transparent Node Trajectories Returns the complete search path through the tree structure, offering transparency and rich contextual information.

  • Exact Page References Every retrieved node includes precise page numbers and locations from the original document for verifiable information retrieval.

  • LLM-Ready Output Format Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.

Here is an example response from the PageIndex Retrieval API.

{ "title": "Monetary Policy and Economic Developments", "node_id": "0004", "nodes": [ { "title": "March 2024 Summary", "node_id": "0005", "relevant_contents": [{ "page_index": 10, "relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..." }] }, { "title": "June 2023 Summary", "node_id": "0006", "relevant_contents": [{ "page_index": 15, "relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..." }] }] }
Last updated on