Defcon 1-29. Video, audio, papers, pictures (lots of pictures), filler material, music and programs.
1.8 TB in size. Good luck.
Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
In our modern age new text documents are born in a blink of an eye then (often just as quickly) disappear from the internet. We find it a noble task to save these documents for future generations.
This project aims to support this noble goal in a scalable way. We want to make the archival activity streamlined and easy to do even in a huge (Terabyte / Petabyte) scale. This way we hope that more and more people can start their own collection helping the archiving effort.
A collection of links to threat models for various pieces of software and protocols.
Our simple tool allows anyone to generate a public records request with all the necessary legal boilerplate, all for free. Use your FOIA Machine account to track the progress of your requests, all from one place. Access an extensive database of jurisdictions and government agencies to find out where, and how, to send your request.
Powered by Muckrock.
A flatbed document and book scanner. Will also scan 3d objects that'll fit under the camera. Minimum of 13MP image resolution (4160 x 3120), can handle up to A3 size documents. Maximum document thickness: 10mm. Scanner camera's height above the document is adjustable. As fast as one second per scan. Portable - can be folded up for transportation. Can detect when you turn the page or change the document, look for the new page, and automatically take the next image. Abbyy OCR functionality built in. Scans to Word documents, PDF, Excel spreadsheets, or TIFF image files. Software for Windows (back to XP) and OS X.
Shows up as a UVC device under Linux (archived), so any image or video capture software that is UVC enabled can do the work for you.
Populus-Viewer is a tool for decentralized social annotation, built on pdfjs, wavesurfer.js and the Matrix protocol. You can use it to read PDFs, listen to audio, or watch videos, and have rich discussions in the margins, with your friends, classmates, or scholarly collaborators.
Each uploaded file is attached to a matrix space, and each annotation to the file becomes a room within that space. Populus-Viewer has been tested with synapse and dendrite, but should be compatible with any spec-compliant matrix server.
Changelog:
April 18, 2022: Twenty-eight documents relating to the 2020 presidential election, Donald Trump, and the Jan. 6 Capitol riot were published.
Acrossword is a small async wrapper around the SentenceBERT library. It has a convenient object-oriented API with two main purposes:
semantic search
zero-shot text classification
It's useful if you want to avoid larger bloated libraries with capabilities you don't need, and comes with zero fuss.
Has JSON and XML APIs: https://pacer.uscourts.gov/file-case/developer-resources
Needs an account.
Recoll is a desktop full-text search tool. Finds documents based on their contents as well as their file names. Can search most document formats, even if they're compressed (even Maildir/ and mailboxes). You may need external applications for text extraction. Based on Xapian. Primarily desktop but it could be run server-side. Indices are backwards-compatible.
Source code: https://framagit.org/medoc92/recoll
Flies on solid state storage!
Can be plugged into Searx: https://searx.github.io/searx/admin/engines/recoll.html
Fast, typo tolerant search engine for building delightful search experiences. Has an API and a number of protocol modules for different languages. Written in C and C++.
Designed for people who don't want to fuck with Elasticsearch, they just want a document search engine. Lightweight, powerful, scalable. Tries to have smart defaults. Single executable. Uses far less memory than the usual Java-based search systems do. Tries to be flexible so you can build the search engine you need.
Looks like you define a JSON document with the stuff you want to be able to search and throw it over to the engine. Means you'll need to write some front-end tooling to extract the data you want to index, which might not be that big a deal. It could just be some shell scripts.
A formal schema for representing a resume' or CV as a JSON document so that it's machine readable.
A fast, multi-threaded application that takes apart files, indexes them, and shoves them into Elastic Search. Tries to be portable. Relies upon Elastic Search, unfortunately. Indices can be transported elsewhere (say you've indexed offline storage media) and loaded into the engine.
A curated list of amazingly awesome open source intelligence tools and resources. Open-source intelligence (OSINT) is intelligence collected from publicly available sources. In the intelligence community (IC), the term "open" refers to overt, publicly available sources (as opposed to covert or clandestine sources)
Aleph is a tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets. Web-based search. Processing includes optical character recognition, language and encoding detection and named entity extraction. Load structured entity graph data from databases and CSV files. This allows navigation of complex datasets like companies registries, sanctions lists or procurement data.
A personal file system indexing and search application. Part of the Gnome desktop. Indexes file contents, metadata, and location to better help you find things. Also allows you to do your own tagging of stuff it keeps track of. Uses D-BUS for IPC and SPARQL for search. Uses multiple ontologies for different kinds of files (including multimedia content).
EBMUD recommends customers take some simple steps to be prepared for an emergency. Downloadable documents at this page.
Textricator is a tool for extracting text from PDFs and generating structured data (CSV or JSON). It can even work on OCR'ed documents. Describe what the document's contents look like with a YAML file and it'll extract the data using those fields. Can also be used as a Java library.
Free Software for your own Search Engine, Explorer for Discovery of large document collections, Media Monitoring, Text Analytics, Document Analysis & Text Mining platform based on Apache Solr or Elasticsearch open-source enterprise-search and Open Standards for Linked Data, Semantic Web & Linked Open Data integration.
Usage tutorial here: https://www.opensemanticsearch.org/doc/tutorial
Github: https://github.com/opensemanticsearch
Of course it has an API: https://www.opensemanticsearch.org/doc/admin/rest-api