pText is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives. Extract and edit metadata, extract and edit text and images, add annotations.
Seems like it would be useful for a large-scale indexing effort.