🚀 Quickstart: PageIndex SDK
Get started by first getting your 🔑 API key from the Dashboard .
The SDK consists of three main components:
- PageIndex OCR: Transform PDF to structure-preserving markdowns.
- PageIndex Tree Generation: Generate a PageIndex tree index of the document.
- PageIndex Retrieval: Ask a query to retrieve relevant content from a document.
Below is a brief introduction to using the Python SDK, including example code for common operations.
Install the SDK
pip install pageindex
Initialize the Client
from pageindex import PageIndexClient
pi_client = PageIndexClient(api_key="YOUR_API_KEY")
Submit a document
result = pi_client.submit_document("./2023-annual-report.pdf")
doc_id = result["doc_id"]
📑 PageIndex OCR
Check OCR processing status and get the OCR result.
ocr_result = pi_client.get_ocr(doc_id)
if ocr_result.get("status") == "completed":
print("OCR Result:", ocr_result.get("result"))
🌲 PageIndex Tree Generation
Check tree generation status and get the tree structure.
tree_result = pi_client.get_tree(doc_id)
if tree_result.get("status") == "completed":
print("PageIndex Tree Structure:", tree_result.get("result"))
🔎 PageIndex Retrieval
Currently, only single-document retrieval is supported. Multi-document retrieval is coming soon. See also the Doc Search page for document search examples.
Submit a retrieval query, retrieval function requires a completed PageIndex tree generation.
if pi_client.is_retrieval_ready(doc_id):
retrieval = pi_client.submit_query(doc_id, "What are the main risk factors?")
retrieval_id = retrieval["retrieval_id"]
Check retrieval status and get the retrieval result.
retrieval_result = pi_client.get_retrieval(retrieval_id)
if retrieval_result.get("status") == "completed":
print("Retrieved Content:", retrieval_result.get("retrieved_nodes"))
👉 See the full SDK Reference for optional parameters and more examples.
💬 Support
Last updated on