A python LLM chat app using Django Async and LLAMA2, that allows you to chat with multiple pdf documents. Components are chosen so everything can be self-hosted.
Project uses LLAMA2 hosted via replicate - however, you can self-host your own LLAMA2 instance.
Recoll is a desktop full-text search tool. Finds documents based on their contents as well as their file names. Can search most document formats, even if they're compressed (even Maildir/ and mailboxes). You may need external applications for text extraction. Based on Xapian. Primarily desktop but it could be run server-side. Indices are backwards-compatible.
Source code: https://framagit.org/medoc92/recoll
Flies on solid state storage!
Can be plugged into Searx: https://searx.github.io/searx/admin/engines/recoll.html
A search engine made for honors class. Now made for everyone.
Manticore Search is an open-source search engine that was born in 2017 as a continuation of the famous Sphinx Search engine. We took all the best from that, significantly improved its functionality, fixed hundreds of bugs, and rewrote it almost completely internally. And left it all open-sourced! That’s what makes Manticore Search a modern, fast and light-weight full-featured search engine.
We have designed and developed Manticore to provide you with multi-functional relevant search capabilities with high performance and low resource consumption and importantly easily integrable. It doesn’t matter what environment you are using, whether it’s Windows, Linux, MacOS or Docker, you can always use Manticore Search and connect to it from different programming languages or HTTP via JSON or even using MySQL client.
Github: https://github.com/manticoresoftware/manticoresearch
Problems I'm seeing
Approximate Nearest Neighbors is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data. It has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.
Fast, typo tolerant search engine for building delightful search experiences. Has an API and a number of protocol modules for different languages. Written in C and C++.
Designed for people who don't want to fuck with Elasticsearch, they just want a document search engine. Lightweight, powerful, scalable. Tries to have smart defaults. Single executable. Uses far less memory than the usual Java-based search systems do. Tries to be flexible so you can build the search engine you need.
Looks like you define a JSON document with the stuff you want to be able to search and throw it over to the engine. Means you'll need to write some front-end tooling to extract the data you want to index, which might not be that big a deal. It could just be some shell scripts.
A collection of awesome web crawl, scraping, and spidering projects in different languages.
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
Aleph is a tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets. Web-based search. Processing includes optical character recognition, language and encoding detection and named entity extraction. Load structured entity graph data from databases and CSV files. This allows navigation of complex datasets like companies registries, sanctions lists or procurement data.
A personal file system indexing and search application. Part of the Gnome desktop. Indexes file contents, metadata, and location to better help you find things. Also allows you to do your own tagging of stuff it keeps track of. Uses D-BUS for IPC and SPARQL for search. Uses multiple ontologies for different kinds of files (including multimedia content).
Full text indexing and search in go. Index any data object in your software: Text strings, numbers, dates, JSON. Extensible. Aims to have smart defaults. Multiple query types supported, even multiple natural languages. Aims to be easy to use: Three lines of code to index, another three lines to search.
Free Software for your own Search Engine, Explorer for Discovery of large document collections, Media Monitoring, Text Analytics, Document Analysis & Text Mining platform based on Apache Solr or Elasticsearch open-source enterprise-search and Open Standards for Linked Data, Semantic Web & Linked Open Data integration.
Usage tutorial here: https://www.opensemanticsearch.org/doc/tutorial
Github: https://github.com/opensemanticsearch
Of course it has an API: https://www.opensemanticsearch.org/doc/admin/rest-api
Mairix is software for indexing and searching your e-mail queues. Can handle Maildir/, MH, and mailbox mail queues. Runs incrementally on every new e-mail. Search results go into virtual mail folders in your mail client. Extremely fast.
Specific documentation for telling YaCy to start crawling a site, the mode in which to index it, where to start, how deeply, et cetera. indexing exocortex spider