What is PageIndex?
PageIndex is a reasoning-based retrieval system that simulates how human experts navigate and extract knowledge from documents. Rather than relying on vector-based semantic similarity search, it transforms documents into hierarchical tree structures and conducts structured tree searches to identify and retrieve the most relevant information.

Get started with our cookbook for a quick hands-on tutorial.
PageIndex Tools
PageIndex have four available tools:
- PageIndex OCR: converts PDFs to markdown with global structure preserved, ready for tree generation.
- PageIndex Tree Generation: generates hierarchical tree indexes for documents.
- PageIndex Retrieval: performs retrieval via tree search.
- PageIndex MCP: integrate PageIndex with your LLM agents.
📑 PageIndex OCR
Classic OCR systems analyze each page in isolation — dividing it into blocks, processing each block independently, and ultimately returning a flat, fragmented output with structural errors and loss of document hierarchy. PageIndex OCR leverages the context window of large vision-language models and treats the entire document as a cohesive, structured whole. It can not only generate accurate page-level markdown content, but also preserve the hierarchical organization of content — titles, sections, subsections, bullet lists, tables, references — across page boundaries.
-
Accurate Page-level Markdown Content: PageIndex OCR can transform each page into LLM-ready markdown text.
-
Preserving Multi-page Structure: PageIndex OCR preserves the hierarchical structure of the whole document, significantly improving markdown rendering and document representation.
-
Fast Processing: PageIndex OCR handles long documents efficiently and scales to long context windows without compromising speed.
🌲 PageIndex Tree Generation
PageIndex generates a hierarchical “table of contents” tree that maintains the original document’s logical flow and organizational structure. This LLM-optimized “table of contents” enables precise navigation and is ready for reasoning-based RAG.
-
No Vector DB Required: Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.
-
No Chunking Required: Preserves natural document structure without artificial text splitting for better context retention.
-
Node Location and Summary: Provides node page number and summary for precise information navigation.
-
Optimized for Long Documents: Tree generation is optimized for financial reports, legal documents, and technical manuals that exceed LLM context limits.
Here is an example output format, see our cookbook for a practical example.
...
{
"title": "Financial Stability",
"node_id": "0006",
"page_index": 21,
"text": "The Federal Reserve maintains financial stability through comprehensive...",
"prefix_summary": "This section discusses...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"page_index": 22,
"text": "The Federal Reserve's monitoring focuses on identifying...",
"summary": "This section discusses..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"page_index": 28,
"text": "In 2023, the Federal Reserve collaborated internationally...",
"summary": "This section discusses..."
}
]
}
...
🔎 PageIndex Retrieval
Once documents are transformed into hierarchical tree structures, the PageIndex retrieval module extracts relevant context from these trees. It leverages both LLM-based tree search and value-based tree search to perform efficient and accurate retrieval.
Specifically, given a query and a tree, the retrieval module performs a tree search and returns the most relevant nodes, along with relevant paragraphs and corresponding tree search trajectories. The retrieval process has the following properties:
-
No Top-K Selection Required Tree search automatically identifies all relevant tree nodes, avoiding manual parameter tuning and arbitrary cutoffs in retrieval.
-
Transparent Node Trajectories Returns the complete search path through the tree structure, offering transparency and rich contextual information.
-
Exact Page References Every retrieved node includes precise page numbers and locations from the original document for verifiable information retrieval.
-
LLM-Ready Output Format Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.
Here is an example response from the PageIndex Retrieval API.
{
"title": "Monetary Policy and Economic Developments",
"node_id": "0004",
"nodes": [
{
"title": "March 2024 Summary",
"node_id": "0005",
"relevant_contents": [{
"page_index": 10,
"relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
}]
},
{
"title": "June 2023 Summary",
"node_id": "0006",
"relevant_contents": [{
"page_index": 15,
"relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
}]
}]
}