Kreuzberg is a Python library for text extraction from documents. It provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs. Tries to Just Work, without complex configuration. No external API calls or cloud dependencies required. Lightweight processing without GPU requirements. Comprehensive support for documents, images, and text formats. Support for Tesseract, EasyOCR, and PaddleOCR.
A collaborative note taking, wiki and documentation platform that scales. Built with Django and React. Opensource alternative to Notion or Outline. Works offline; write locally and it'll re-synch when you come back. Tries to concentrate on clean documents, not lots of formatting. Optimized for multiuser collaboration. Granular access controls. Can export to the usual document formats.
Hard requirements: Kubernetes, Postgres, memcached, an S3-compatible bucket for storage, and an OIDC provider for authentication. Heavy enough that I'd call it enterprisey.
An end-to-end encrypted collaborative office suite. Multiple users can work on the same document at the same time. Everything is encrypted end-to-end, including on disk. There's a full suite of applications: A rich text editor, spreadsheet, IDE, kanban, presentation slide editor, whiteboard, and buildable forms. Use theirs or stand up your own instance.
AMB stands for "Ancient Machine Book". It is an extremely lightweight file format meant to store any kind of hypertext documentation that may be comfortably viewed even on the most ancient PCs: technical manuals, books, etc. Think of it as a retro equivalent of a *.CHM help file.
This web page holds the format specification, as well as reference tools to work with the format: AMB (the reader) and AMBPACK (archive packer/unpacker). All tools are published under the terms of the MIT license.
A dead simple way of OCR-ing a document for AI ingestion. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. The vision models just make sense! Uses gpt-4o-mini to look at and figure out what's in the images, but so far it doesn't seem to support self-hosted models.
The general logic:
Doclytics is a straightforward Rust-based tool that integrates with the paperless-ngx API to fetch and update document metadata. It primarily leverages a local language model, ollama, to extract and generate metadata for documents stored in a Paperless document library. The tool uses reqwest for making HTTP requests and serde_json for handling JSON data, ensuring seamless communication with the Paperless API and efficient data processing.
By interfacing directly with ollama, Doclytics automates the extraction of specified metadata from documents, utilizing the local LLM's capabilities to analyze document content and produce the required metadata in a JSON format. This metadata is then used to update the respective documents in the Paperless library, aiming to improve document organization and retrievability without overly complex processes or configurations.
A python LLM chat app using Django Async and LLAMA2, that allows you to chat with multiple pdf documents. Components are chosen so everything can be self-hosted.
Project uses LLAMA2 hosted via replicate - however, you can self-host your own LLAMA2 instance.
Formerly The Bell System Practices (BSP) Archive.
An indexed collection of 36832 telecom and related documents totaling more than 2 million pages.
How to use cups-virtual-printer to print stuff directly to paperless-ngx to save some hassle.
Your AI second brain. A copilot to search and chat (using RAG) with your knowledge base (pdf, markdown, org). Use powerful, online (e.g gpt4) or private, offline (e.g mistral) LLMs. Self-host locally or have it always accessible on the cloud. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp
Khoj is an AI application to search and chat with your notes and documents. It is open-source, self-hostable and accessible on Desktop, Emacs, Obsidian, Web and Whatsapp. It works with pdf, markdown, org-mode, notion files and github repositories. It can paint, search the internet and understand speech.
The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.
There is also an API built around this module.
Defcon 1-29. Video, audio, papers, pictures (lots of pictures), filler material, music and programs.
1.8 TB in size. Good luck.
Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
In our modern age new text documents are born in a blink of an eye then (often just as quickly) disappear from the internet. We find it a noble task to save these documents for future generations.
This project aims to support this noble goal in a scalable way. We want to make the archival activity streamlined and easy to do even in a huge (Terabyte / Petabyte) scale. This way we hope that more and more people can start their own collection helping the archiving effort.
A collection of links to threat models for various pieces of software and protocols.
Our simple tool allows anyone to generate a public records request with all the necessary legal boilerplate, all for free. Use your FOIA Machine account to track the progress of your requests, all from one place. Access an extensive database of jurisdictions and government agencies to find out where, and how, to send your request.
Powered by Muckrock.
A flatbed document and book scanner. Will also scan 3d objects that'll fit under the camera. Minimum of 13MP image resolution (4160 x 3120), can handle up to A3 size documents. Maximum document thickness: 10mm. Scanner camera's height above the document is adjustable. As fast as one second per scan. Portable - can be folded up for transportation. Can detect when you turn the page or change the document, look for the new page, and automatically take the next image. Abbyy OCR functionality built in. Scans to Word documents, PDF, Excel spreadsheets, or TIFF image files. Software for Windows (back to XP) and OS X.
Shows up as a UVC device under Linux (archived), so any image or video capture software that is UVC enabled can do the work for you.
Populus-Viewer is a tool for decentralized social annotation, built on pdfjs, wavesurfer.js and the Matrix protocol. You can use it to read PDFs, listen to audio, or watch videos, and have rich discussions in the margins, with your friends, classmates, or scholarly collaborators.
Each uploaded file is attached to a matrix space, and each annotation to the file becomes a room within that space. Populus-Viewer has been tested with synapse and dendrite, but should be compatible with any spec-compliant matrix server.
Changelog:
April 18, 2022: Twenty-eight documents relating to the 2020 presidential election, Donald Trump, and the Jan. 6 Capitol riot were published.
Acrossword is a small async wrapper around the SentenceBERT library. It has a convenient object-oriented API with two main purposes:
semantic search
zero-shot text classification
It's useful if you want to avoid larger bloated libraries with capabilities you don't need, and comes with zero fuss.