Skip to Content
Quickstart

🚀 Quickstart: PageIndex SDK

Get started by first getting your 🔑 API key from the Dashboard.

The SDK consists of three main components:

  • PageIndex OCR: Transform PDF to structure-preserving markdowns.
  • PageIndex Tree Generation: Generate a PageIndex tree index of the document.
  • PageIndex Retrieval: Ask a query to retrieve relevant content from a document.

Below is a brief introduction to using the Python SDK, including example code for common operations.

Install the SDK

pip install pageindex

Initialize the Client

from pageindex import PageIndexClient pi_client = PageIndexClient(api_key="YOUR_API_KEY")

Submit a document

result = pi_client.submit_document("./2023-annual-report.pdf") doc_id = result["doc_id"]

📑 PageIndex OCR

Check OCR processing status and get the OCR result.

ocr_result = pi_client.get_ocr(doc_id) if ocr_result.get("status") == "completed": print("OCR Result:", ocr_result.get("result"))

🌲 PageIndex Tree Generation

Check tree generation status and get the tree structure.

tree_result = pi_client.get_tree(doc_id) if tree_result.get("status") == "completed": print("PageIndex Tree Structure:", tree_result.get("result"))

🔎 PageIndex Retrieval

Currently, only single-document retrieval is supported. Multi-document retrieval is coming soon. See also the Doc Search page for document search examples.

Submit a retrieval query, retrieval function requires a completed PageIndex tree generation.

if pi_client.is_retrieval_ready(doc_id): retrieval = pi_client.submit_query(doc_id, "What are the main risk factors?") retrieval_id = retrieval["retrieval_id"]

Check retrieval status and get the retrieval result.

retrieval_result = pi_client.get_retrieval(retrieval_id) if retrieval_result.get("status") == "completed": print("Retrieved Content:", retrieval_result.get("retrieved_nodes"))

👉 See the full SDK Reference for optional parameters and more examples.

💬 Support

Last updated on