Recoll is a desktop full-text search tool. Finds documents based on their contents as well as their file names. Can search most document formats, even if they're compressed (even Maildir/ and mailboxes). You may need external applications for text extraction. Based on Xapian. Primarily desktop but it could be run server-side. Indices are backwards-compatible.
Source code: https://framagit.org/medoc92/recoll
Flies on solid state storage!
Can be plugged into Searx: https://searx.github.io/searx/admin/engines/recoll.html
Fast, typo tolerant search engine for building delightful search experiences. Has an API and a number of protocol modules for different languages. Written in C and C++.
Designed for people who don't want to fuck with Elasticsearch, they just want a document search engine. Lightweight, powerful, scalable. Tries to have smart defaults. Single executable. Uses far less memory than the usual Java-based search systems do. Tries to be flexible so you can build the search engine you need.
Looks like you define a JSON document with the stuff you want to be able to search and throw it over to the engine. Means you'll need to write some front-end tooling to extract the data you want to index, which might not be that big a deal. It could just be some shell scripts.
A formal schema for representing a resume' or CV as a JSON document so that it's machine readable.
A fast, multi-threaded application that takes apart files, indexes them, and shoves them into Elastic Search. Tries to be portable. Relies upon Elastic Search, unfortunately. Indices can be transported elsewhere (say you've indexed offline storage media) and loaded into the engine.
A curated list of amazingly awesome open source intelligence tools and resources. Open-source intelligence (OSINT) is intelligence collected from publicly available sources. In the intelligence community (IC), the term "open" refers to overt, publicly available sources (as opposed to covert or clandestine sources)
Aleph is a tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets. Web-based search. Processing includes optical character recognition, language and encoding detection and named entity extraction. Load structured entity graph data from databases and CSV files. This allows navigation of complex datasets like companies registries, sanctions lists or procurement data.
A personal file system indexing and search application. Part of the Gnome desktop. Indexes file contents, metadata, and location to better help you find things. Also allows you to do your own tagging of stuff it keeps track of. Uses D-BUS for IPC and SPARQL for search. Uses multiple ontologies for different kinds of files (including multimedia content).
EBMUD recommends customers take some simple steps to be prepared for an emergency. Downloadable documents at this page.
Textricator is a tool for extracting text from PDFs and generating structured data (CSV or JSON). It can even work on OCR'ed documents. Describe what the document's contents look like with a YAML file and it'll extract the data using those fields. Can also be used as a Java library.
A free utility which extracts text from damaged Microsoft Word files, which can then be saved into a new file.
How to extract and install the new Microsoft Office fonts without needing Office 2007 or Windows.
An excellent article from the Internet Storm Center about carving executables out of other sorts of files (like .rtf documents) for the purposes of identification and reverse engineering.
F/OSS software written in Java to split, merge, extract pages or images, and alter the contents of .PDF files in other ways. Works pretty well on Linux, haven't tried Windows yet.
CozyCloud is an open source cloud application set that is personal, i.e., it's meant for one person to use for productivity and information organization. You set it up and it's yours, no one can take it (or your data) away from you. Comes out of the box with a webmail client which unifies all of your e-mail accounts, a notes and document editor and manager, a to-do tracker, a bookmark manager, a financial account manager, and an RSS feed reader. Includes a framework for developing new apps, which are appearing all the time. Built on top of node.js.
MyMemory is an online translation system that uses both machine implemented translation and human-contributed translations, probably with some form of machine learning on the back-end. Users can upload files with their own translations to improve the service's accuracy. These documents, called memories, can either be public or private. They are also working to make translations more readily searchable. They have a REST API that we can use.
A FOSS platform for analysing and visualizing large sets of documents. Designed with investigative journalists in mind but has other uses. Has a plugin system. Can be run as a server or locally (https://github.com/overview/overview-local). Designed to run as a docker container (bluh) but running it as-is shouldn't be difficult. Written in scala (wtf?) Seems to use postgres as its back-end.
3697 links, including 185 private