Kreuzberg is a Python library for text extraction from documents. It provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs. Tries to Just Work, without complex configuration. No external API calls or cloud dependencies required. Lightweight processing without GPU requirements. Comprehensive support for documents, images, and text formats. Support for Tesseract, EasyOCR, and PaddleOCR.
Utility code that can extract an AppleDouble file's contents and extract the individual resources from its resource fork segment.
This is useful if you have stored Mac files on FAT32-formatted floppy disks or in macOS X ZIP archives and want to (further) extract the data from them on a non-Mac operating system.
Textricator is a tool for extracting text from PDFs and generating structured data (CSV or JSON). It can even work on OCR'ed documents. Describe what the document's contents look like with a YAML file and it'll extract the data using those fields. Can also be used as a Java library.