PageIndex

Ideas Behind PageIndex

A selection of posts from the PageIndex blog: technical deep-dives, research, insights, and more.

View all posts

The PageIndex framework that uses LLMs to reason over the document’s tree index for retrieval, instead of relying on static vector similarity.

Scaling PageIndex’s vectorless retrieval to millions of documents by unifying document trees with file system hierarchies and query trees.

Why vector retrieval can’t condition on full context — a fundamental limitation of vector RAG systems.

Why technical manuals break conventional RAG, and how PageIndex’s reasoning-based retrieval solves their unique challenges.

The same bet behind both PageIndex and Claude Code — skip the vector DB and let the LLM itself drive retrieval.

Rethinking the OCR pipeline from an information-theoretic view, and when a direct vision-based approach wins.

The first long-context OCR model, preserving the global structure and section hierarchy of documents.

State-of-the-art 98.7% accuracy on FinanceBench, powered by PageIndex’s vectorless, reasoning-based RAG engine.