API Reference

📑 PageIndex PDF Processing API Endpoints

Submit Document for Processing

Endpoint: POST https://api.pageindex.ai/doc/
Description: Uploads a PDF document for processing. The system automatically processes both tree generation and OCR, then immediately returns a document identifier (doc_id) for subsequent operations.

Request Body (multipart/form-data):

file (binary, required): PDF document.

Example


import requests
 
api_key = "YOUR_API_KEY"
file_path = "./2023-annual-report.pdf"
 
with open(file_path, "rb") as file:
    response = requests.post(
        "https://api.pageindex.ai/doc/",
        headers={"api_key": api_key},
        files={"file": file}
    )

Example Response:


{ "doc_id": "abc123def456" }

Get Processing Status & Results

Endpoint: GET https://api.pageindex.ai/doc/{doc_id}/
Description: Check processing status and (when complete) get the results for a submitted document.

Parameters (URL Path):

doc_id (string, required): Document ID.

Query Parameters:

type (string, optional): Result type. Use "tree" for tree structure or "ocr" for OCR results. If not specified, returns the default result type based on the original processing type.
format (string, optional): For OCR results, specify output format. Use "page" (default) for page-based results, "node" for node-based results, or "raw" for concatenated markdown.
summary (boolean, optional): For tree results, include node summary for each node in response. Default is false.

Example - Get OCR Results:


import requests
 
api_key = "YOUR_API_KEY"
doc_id = "abc123def456"
 
# Get OCR results (default, in page format)
response = requests.get(
    f"https://api.pageindex.ai/doc/{doc_id}/?type=ocr",
    headers={"api_key": api_key}
)
 
# Get OCR results in node format
response = requests.get(
    f"https://api.pageindex.ai/doc/{doc_id}/?type=ocr&format=node",
    headers={"api_key": api_key}
)
 
# Get OCR results in raw format (concatenated markdown)
response = requests.get(
    f"https://api.pageindex.ai/doc/{doc_id}/?type=ocr&format=raw",
    headers={"api_key": api_key}
)

Example - Get Tree Structure:


import requests
 
api_key = "YOUR_API_KEY"
doc_id = "abc123def456"
 
response = requests.get(
    f"https://api.pageindex.ai/doc/{doc_id}/?type=tree",
    headers={"api_key": api_key}
)

Example Response (Tree Processing):


{
  "doc_id": "abc123def456",
  "status": "processing",
  "retrieval_ready": false
}

Example Response (Tree Completed):


{
  "doc_id": "abc123def456",
  "status": "completed",
  "retrieval_ready": true,
  "result": [
    ...
    {
      "title": "Financial Stability",
      "node_id": "0006",
      "page_index": 21,
      "text": "The Federal Reserve maintains financial stability through comprehensive monitoring and regulatory oversight...",
      "nodes": [
        {
          "title": "Monitoring Financial Vulnerabilities",
          "node_id": "0007",
          "page_index": 22,
          "text": "The Federal Reserve's monitoring focuses on identifying and assessing potential risks..."
        },
        {
          "title": "Domestic and International Cooperation and Coordination",
          "node_id": "0008",
          "page_index": 28,
          "text": "In 2023, the Federal Reserve collaborated internationally with central banks and regulatory authorities..."
        }
      ]
    }
    ...
  ]
}

Notes:

For tree generation: The "result" field contains the hierarchical tree structure.
For OCR processing: The "result" field format depends on the format parameter:
- "page" (default): List of page objects, each containing page_index, markdown, and images
- "node": List of node objects, organized by document structure
- "raw": Single string containing all markdown content concatenated together
page_index is 1-based (the first page is 1).
markdown contains the recognized text in markdown format.
images is a list of base64-encoded images detected on that page; it may be empty.

Delete a PageIndex Document

Endpoint: DELETE https://api.pageindex.ai/doc/{doc_id}/
Description: Permanently delete a PageIndex document and all its associated data.

Parameters (URL Path):

doc_id (string, required): Document ID.

Example:


import requests
 
api_key = "YOUR_API_KEY"
doc_id = "abc123def456"
 
response = requests.delete(
    f"https://api.pageindex.ai/doc/{doc_id}/",
    headers={"api_key": api_key}
)

🔍 PageIndex Retrieval API (Legacy)

⚠️

We are working on a new agentic retrieval API. See the agentic retrieval tutorial for a minimal preview.
This old retrieval API remains available for backward compatibility.

Retrieve from a PageIndex Document

Endpoint: POST https://api.pageindex.ai/retrieval/
Description: Submit a query to create a retrieval task for a specific PageIndex document. It returns a retrieval task ID.

Before Retrieval

Before submitting a retrieval query, you should check if the document is ready for retrieval by checking the retrieval_ready field in the tree endpoint response:


# Check if document is ready for retrieval
tree_response = requests.get(
    f"https://api.pageindex.ai/doc/{doc_id}/?type=tree",
    headers={"api_key": api_key}
)
retrieval_ready = tree_response.json().get("retrieval_ready")

Parameters (in JSON body):

doc_id (string, required): The PageIndex document ID to retrieve from.
query (string, required): The user question or information need.
thinking (boolean, optional): If set to true, the model will first plan what information is required before performing retrieval, helping you gather more comprehensive and relevant information. The default is false.

Example:


import requests
 
api_key = "YOUR_API_KEY"
payload = {
    "doc_id": "abc123def456",
    "query": "What are the main sources of revenue?",
    "thinking": False
}
 
response = requests.post(
    "https://api.pageindex.ai/retrieval/",
    headers={"api_key": api_key},
    json=payload
)

Example Response:


{
  "retrieval_id": "xyz789ghi012"
}

Get Retrieval Status & Results

Endpoint: GET https://api.pageindex.ai/retrieval/{retrieval_id}/
Description: Get the status and, when ready, the result for a specific retrieval query.

Parameters (URL Path):

retrieval_id (string, required)

Example:


import requests
 
api_key = "YOUR_API_KEY"
retrieval_id = "xyz789ghi012"
 
response = requests.get(
    f"https://api.pageindex.ai/retrieval/{retrieval_id}/",
    headers={"api_key": api_key}
)

Example Response (Processing):


{
  "retrieval_id": "xyz789ghi012",
  "status": "processing"
}

Example Response (Completed):


{
  "retrieval_id": "xyz789ghi012",
  "doc_id": "abc123def456",
  "status": "completed",
  "query": "What are the recent trends in the labor market?",
  "retrieved_nodes": [
    {
      "title": "March 2024 Summary",
      "node_id": "0005",
      "relevant_contents": [
        {
          "page_index": 10,
          "relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
        }
      ]
    }
  ]
}

📝 Markdown Processing API

Process markdown files to generate PageIndex hierarchical tree structures without requiring PDF conversion.

Convert Markdown to Tree Structure

Endpoint: POST https://api.pageindex.ai/markdown/
Description: Upload a markdown file and convert it directly to a hierarchical tree structure. This endpoint extracts the document structure based on markdown headers (#, ##, ###, etc.) and optionally applies tree thinning and generates summaries.

Request Body (multipart/form-data):

Required Parameters:

file (binary, required): Markdown document (.md or .markdown files).

Optional Parameters:

if_add_node_id (string, optional): Whether to add node IDs. Options: "yes" or "no". Default: "yes".
if_add_node_summary (string, optional): Whether to add node summaries. Options: "yes" or "no". Default: "yes".
if_add_node_text (string, optional): Whether to include node text content. Options: "yes" or "no". Default: "yes".
if_add_doc_description (string, optional): Whether to add document description. Options: "yes" or "no". Default: "no".

Example:


import requests
 
api_key = "YOUR_API_KEY"
 
with open("./README.md", "rb") as file:
    response = requests.post(
        "https://api.pageindex.ai/markdown/",
        headers={"api_key": api_key},
        files={"file": file}
    )
 
result = response.json()

Example Response:


{
  "success": true,
  "doc_name": "README",
  "structure": [
    {
      "title": "Getting Started",
      "node_id": "0000",
      "summary": "Introduction and setup guide for the API...",
      "line_num": 1,
      "nodes": [
        {
          "title": "Installation",
          "node_id": "0001",
          "summary": "Installation instructions using pip...",
          "line_num": 5,
        },
        {
          "title": "Authentication",
          "node_id": "0002",
          "summary": "How to authenticate with the API...",
          "line_num": 10,
        }
      ]
    }
  ]
}

Notes:

Tree thinning can be applied to merge small nodes with their children when token count is below the threshold.
Node summaries and document descriptions are generated using the specified LLM model.
The line_num field indicates the starting line number of each section in the original markdown file.

📑 PageIndex PDF Processing API Endpoints

Submit Document for Processing

Get Processing Status & Results

Delete a PageIndex Document

🔍 PageIndex Retrieval API (Legacy)

Retrieve from a PageIndex Document

Get Retrieval Status & Results

📝 Markdown Processing API

Convert Markdown to Tree Structure

💬 Community & Support