Upload, process, and manage PDF documents
💭 Chat API (beta)Conversational AI over your documents
📝 Markdown ProcessingConvert markdown to tree structures
🔍 Retrieval (legacy)Query and retrieve from processed documents
All endpoints require an api_key header. You can get your API key from the Developer Dashboard .
📑 PageIndex PDF Processing API Endpoints
Submit Document for Processing
- Endpoint:
POSThttps://api.pageindex.ai/doc/ - Description: Uploads a PDF document for processing. The system automatically processes both tree generation and OCR, then immediately returns a document identifier (
doc_id) for subsequent operations.
Request Body (multipart/form-data):
file(binary, required): PDF document.mode(string, optional): Processing mode. Set to"mcp"to make the document accessible via PageIndex MCP.
Example
import requests
api_key = "YOUR_API_KEY"
file_path = "./2023-annual-report.pdf"
with open(file_path, "rb") as file:
response = requests.post(
"https://api.pageindex.ai/doc/",
headers={"api_key": api_key},
files={"file": file}
)Example Response:
{ "doc_id": "pi-abc123def456" }Get Processing Status & Results
- Endpoint:
GEThttps://api.pageindex.ai/doc/{doc_id}/ - Description: Check processing status and (when complete) get the results for a submitted document.
Parameters (URL Path):
doc_id(string, required): Document ID.
Query Parameters:
type(string, optional): Result type. Use"tree"for tree structure or"ocr"for OCR results. If not specified, returns the default result type based on the original processing type.format(string, optional): For OCR results, specify output format. Use"page"(default) for page-based results,"node"for node-based results, or"raw"for concatenated markdown.summary(boolean, optional): For tree results, include node summary for each node in response. Default isfalse.
Example - Get OCR Results:
import requests
api_key = "YOUR_API_KEY"
doc_id = "pi-abc123def456"
# Get OCR results (default, in page format)
response = requests.get(
f"https://api.pageindex.ai/doc/{doc_id}/?type=ocr",
headers={"api_key": api_key}
)
# Get OCR results in node format
response = requests.get(
f"https://api.pageindex.ai/doc/{doc_id}/?type=ocr&format=node",
headers={"api_key": api_key}
)
# Get OCR results in raw format (concatenated markdown)
response = requests.get(
f"https://api.pageindex.ai/doc/{doc_id}/?type=ocr&format=raw",
headers={"api_key": api_key}
)Example - Get Tree Structure:
import requests
api_key = "YOUR_API_KEY"
doc_id = "pi-abc123def456"
response = requests.get(
f"https://api.pageindex.ai/doc/{doc_id}/?type=tree",
headers={"api_key": api_key}
)Example Response (Tree Processing):
{
"doc_id": "pi-abc123def456",
"status": "processing",
"retrieval_ready": false
}Example Response (Tree Completed):
{
"doc_id": "pi-abc123def456",
"status": "completed",
"retrieval_ready": true,
"result": [
...
{
"title": "Financial Stability",
"node_id": "0006",
"page_index": 21,
"text": "The Federal Reserve maintains financial stability through comprehensive monitoring and regulatory oversight...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"page_index": 22,
"text": "The Federal Reserve's monitoring focuses on identifying and assessing potential risks..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"page_index": 28,
"text": "In 2023, the Federal Reserve collaborated internationally with central banks and regulatory authorities..."
}
]
}
...
]
}Notes:
- For tree generation: The
"result"field contains the hierarchical tree structure. - For OCR processing: The
"result"field format depends on theformatparameter:"page"(default): List of page objects, each containingpage_index,markdown, andimages"node": List of node objects, organized by document structure"raw": Single string containing all markdown content concatenated together
page_indexis 1-based (the first page is 1).markdowncontains the recognized text in markdown format.imagesis a list of base64-encoded images detected on that page; it may be empty.
Get Document Metadata
- Endpoint:
GEThttps://api.pageindex.ai/doc/{doc_id}/metadata - Description: Retrieve document metadata including processing status, page count, and creation time.
Parameters (URL Path):
doc_id(string, required): Document ID.
Example:
import requests
api_key = "YOUR_API_KEY"
doc_id = "pi-abc123def456"
response = requests.get(
f"https://api.pageindex.ai/doc/{doc_id}/metadata",
headers={"api_key": api_key}
)Example Response:
{
"id": "pi-abc123def456",
"name": "research_paper.pdf",
"description": "Machine Learning Research Paper",
"status": "completed",
"createdAt": "2024-01-15T10:30:00.000Z",
"pageNum": 42
}List Documents
- Endpoint:
GEThttps://api.pageindex.ai/docs - Description: Retrieve a paginated list of all documents, ordered by creation date (newest first).
Query Parameters:
limit(int, optional): Maximum number of documents to return (1-100). Default:50.offset(int, optional): Number of documents to skip for pagination. Default:0.
Example:
import requests
api_key = "YOUR_API_KEY"
response = requests.get(
"https://api.pageindex.ai/docs",
headers={"api_key": api_key},
params={"limit": 10, "offset": 0}
)Example Response:
{
"documents": [
{
"id": "pi-abc123def456",
"name": "research_paper.pdf",
"description": "Machine Learning Research Paper",
"status": "completed",
"createdAt": "2024-01-15T10:30:00.000Z",
"pageNum": 42
}
],
"total": 25,
"limit": 10,
"offset": 0
}Delete a PageIndex Document
- Endpoint:
DELETEhttps://api.pageindex.ai/doc/{doc_id}/ - Description: Permanently delete a PageIndex document and all its associated data.
Parameters (URL Path):
doc_id(string, required): Document ID.
Example:
import requests
api_key = "YOUR_API_KEY"
doc_id = "pi-abc123def456"
response = requests.delete(
f"https://api.pageindex.ai/doc/{doc_id}/",
headers={"api_key": api_key}
)💭 PageIndex Chat API (beta)
Overview
The PageIndex Chat API (beta) provides conversational AI with integrated access to your PageIndex documents.
Endpoint: POST https://api.pageindex.ai/chat/completions
Authentication
Include your PageIndex API key in the request header:
api_key: YOUR_PAGEINDEX_API_KEYRequest
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
messages | Array | Yes | - | Conversation messages |
stream | Boolean | No | false | Enable streaming |
doc_id | String | Array | No | null | The ID(s) of document(s) to select |
temperature | Float | No | null | Sampling temperature (0.0 to 1.0). Lower is more deterministic. |
enable_citations | Boolean | No | false | Enable inline citations in responses (e.g., <doc=file.pdf;page=1>) |
Example Request Body
{
"messages": [
{
"role": "user",
"content": "What are the key findings of the first paper?"
}
],
"stream": false
}Example Request with Document ID
When you include a doc_id, your query is scoped to that specific document. You can pass a single document ID as a string, or multiple IDs as an array.
Single Document ID:
{
"doc_id": "pi-123456",
"messages": [
{
"role": "user",
"content": "What are the key findings of this document?"
}
],
"stream": false
}Multiple Document IDs:
{
"doc_id": ["pi-123456", "pi-789012"],
"messages": [
{
"role": "user",
"content": "Compare these documents"
}
],
"stream": false
}Response
Non-Streaming Response
{
"id": "chat_completion_id",
"choices": [
{
"message": {
"role": "assistant",
"content": "The key findings are..."
},
"finish_reason": "end_turn"
}
],
"usage": {
"prompt_tokens": 1234,
"completion_tokens": 567,
"total_tokens": 1801
}
}Streaming Response
Server-Sent Events (SSE) format:
data: {"choices":[{"delta":{"content":"The"}}]}
data: {"choices":[{"delta":{"content":" key"}}]}
data: {"choices":[{"delta":{"content":" findings"}}]}
...
data: [DONE]Python Examples
Non-Streaming
import requests
response = requests.post(
"https://api.pageindex.ai/chat/completions",
headers={
"api_key": "your-pageindex-api-key",
"Content-Type": "application/json"
},
json={
"messages": [
{"role": "user", "content": "What is the first paper about?"}
]
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])Streaming with Document ID
import requests
import json
response = requests.post(
"https://api.pageindex.ai/chat/completions",
headers={
"api_key": "your-pageindex-api-key",
"Content-Type": "application/json"
},
json={
"doc_id": "pi-123456",
"messages": [
{"role": "user", "content": "What are the key findings of this document?"}
],
"stream": True,
},
stream=True
)
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:]
if data == '[DONE]':
break
chunk = json.loads(data)
content = chunk.get("choices", [{}])[0].get("delta", {}).get("content", "")
if content:
print(content, end='', flush=True)Accessing Intermediate Data in Streaming
Track what documents and tools are being accessed in real-time:
import requests
import json
response = requests.post(
"https://api.pageindex.ai/chat/completions",
headers={
"api_key": "your-pageindex-api-key",
"Content-Type": "application/json"
},
json={
"messages": [
{"role": "user", "content": "What is the first paper about?"}
],
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:]
if data == '[DONE]':
break
chunk = json.loads(data)
# Get intermediate metadata
metadata = chunk.get("block_metadata", {})
if metadata:
block_type = metadata.get("type")
block_index = metadata.get("block_index")
# Tool call started
if block_type == "mcp_tool_use_start":
tool_name = metadata.get("tool_name")
server_name = metadata.get("server_name")
print(f"\n[Block #{block_index}: Calling {tool_name}]\n")
# Tool result received
elif block_type == "mcp_tool_result_start":
print(f"\n[Block #{block_index}: Tool result received]\n")
# Get content
content = chunk.get("choices", [{}])[0].get("delta", {}).get("content", "")
if content:
print(content, end='', flush=True)Available block types:
text_block_start/text_stop- Text contentmcp_tool_use_start/mcp_tool_use_stop- PageIndex tool being calledmcp_tool_result_start/mcp_tool_result_stop- PageIndex tool results received
Error Responses
401 Unauthorized
{"detail": "API key not found in request headers"}500 Internal Server Error
{"detail": "Error message"}📝 Markdown Processing API
Convert markdown files to PageIndex tree structures without PDF conversion.
Convert Markdown to Tree Structure
- Endpoint:
POSThttps://api.pageindex.ai/markdown/ - Description: Upload a markdown file and convert it directly to a hierarchical tree structure. This endpoint extracts the document structure based on markdown headers (#, ##, ###, etc.) and optionally applies tree thinning and generates summaries.
Request Body (multipart/form-data):
Required Parameters:
file(binary, required): Markdown document (.md or .markdown files).
Optional Parameters:
if_add_node_id(string, optional): Whether to add node IDs. Options:"yes"or"no". Default:"yes".if_add_node_summary(string, optional): Whether to add node summaries. Options:"yes"or"no". Default:"yes".if_add_node_text(string, optional): Whether to include node text content. Options:"yes"or"no". Default:"yes".if_add_doc_description(string, optional): Whether to add document description. Options:"yes"or"no". Default:"no".
Example:
import requests
api_key = "YOUR_API_KEY"
with open("./README.md", "rb") as file:
response = requests.post(
"https://api.pageindex.ai/markdown/",
headers={"api_key": api_key},
files={"file": file}
)
result = response.json()Example Response:
{
"success": true,
"doc_name": "README",
"structure": [
{
"title": "Getting Started",
"node_id": "0000",
"summary": "Introduction and setup guide for the API...",
"line_num": 1,
"nodes": [
{
"title": "Installation",
"node_id": "0001",
"summary": "Installation instructions using pip...",
"line_num": 5
},
{
"title": "Authentication",
"node_id": "0002",
"summary": "How to authenticate with the API...",
"line_num": 10
}
]
}
]
}Notes:
- Tree thinning can be applied to merge small nodes with their children when token count is below the threshold.
- Node summaries and document descriptions are generated using the specified LLM model.
- The
line_numfield indicates the starting line number of each section in the original markdown file.
🔍 PageIndex Retrieval API (Legacy)
This existing retrieval API is legacy and remains available for backward compatibility.
For most use cases, we recommend using the Chat API instead. We are also working on a new agentic retrieval API — see the agentic retrieval notebook for a minimal preview.
View legacy retrieval endpoints
Retrieve from a PageIndex Document
- Endpoint:
POSThttps://api.pageindex.ai/retrieval/ - Description: Submit a query to create a retrieval task for a specific PageIndex document. It returns a retrieval task ID.
Before Retrieval
Before submitting a retrieval query, you should check if the document is ready for retrieval by checking the retrieval_ready field in the tree endpoint response:
# Check if document is ready for retrieval
tree_response = requests.get(
f"https://api.pageindex.ai/doc/{doc_id}/?type=tree",
headers={"api_key": api_key}
)
retrieval_ready = tree_response.json().get("retrieval_ready")Parameters (in JSON body):
doc_id(string, required): The PageIndex document ID to retrieve from.query(string, required): The user question or information need.thinking(boolean, optional): If set totrue, the model will first plan what information is required before performing retrieval, helping you gather more comprehensive and relevant information. The default isfalse.
Example:
import requests
api_key = "YOUR_API_KEY"
payload = {
"doc_id": "pi-abc123def456",
"query": "What are the main sources of revenue?",
"thinking": False
}
response = requests.post(
"https://api.pageindex.ai/retrieval/",
headers={"api_key": api_key},
json=payload
)Example Response:
{
"retrieval_id": "xyz789ghi012"
}Get Retrieval Status & Results
- Endpoint:
GEThttps://api.pageindex.ai/retrieval/{retrieval_id}/ - Description: Get the status and, when ready, the result for a specific retrieval query.
Parameters (URL Path):
retrieval_id(string, required)
Example:
import requests
api_key = "YOUR_API_KEY"
retrieval_id = "xyz789ghi012"
response = requests.get(
f"https://api.pageindex.ai/retrieval/{retrieval_id}/",
headers={"api_key": api_key}
)Example Response (Processing):
{
"retrieval_id": "xyz789ghi012",
"status": "processing"
}Example Response (Completed):
{
"retrieval_id": "xyz789ghi012",
"doc_id": "pi-abc123def456",
"status": "completed",
"query": "What are the recent trends in the labor market?",
"retrieved_nodes": [
{
"title": "March 2024 Summary",
"node_id": "0005",
"relevant_contents": [
{
"page_index": 10,
"relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
}
]
}
]
}