Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples that can then be queried against in a microsecond's time.
Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query. Sonic is an identifier index, rather than a document index; when queried, it returns IDs that can then be used to refer to the matched documents in an external database.
A strong attention to performance and code cleanliness has been given when designing Sonic. It aims at being crash-free, super-fast and puts minimum strain on server resources (our measurements have shown that Sonic - when under load - responds to search queries in the μs range, eats ~30MB RAM and has a low CPU footprint
Available in Arch as extra/sonic.
Configuration docs: https://github.com/valeriansaliou/sonic/blob/master/CONFIGURATION.md
Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications. Used by just about anything that uses FTR.
Software and documentation for getting media signals (FM RF) out of VHS VCRs, processing them, and writing them out to lossless digital formats for archival.
grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses a fork of wpull for crawling. Gives you a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more; the ability to add ignore patterns when the crawl is already running; an extensively tested default ignore set (global) as well as additional (optional) ignore sets for forums, reddit, etc; duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.
Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
In our modern age new text documents are born in a blink of an eye then (often just as quickly) disappear from the internet. We find it a noble task to save these documents for future generations.
This project aims to support this noble goal in a scalable way. We want to make the archival activity streamlined and easy to do even in a huge (Terabyte / Petabyte) scale. This way we hope that more and more people can start their own collection helping the archiving effort.
Yamanote is a bookmarklet-based bookmarking web app. It’s a web application so you need to run it on a computer, or get a friend to run it for you. When you decide you want to bookmark a page on the web, you click on a Yamanote bookmarklet in your browser’s bookmarks bar (works great on desktop, and in Safari on iOS) to tell the Yamanote server about it. Any text you’ve selected will be added as a “comment” to the bookmark by Yamanote. This is fun because as you read, you can select interesting snippets and keep clicking the bookmarklet to build a personalized list of excerpts. You can add additional commentary to the bookmark in Yamanote, either by editing one of the excerpts made from the bookmarklet or an entirely new comment with its own timestamp. Also, the first time you bookmark a URL, your browser will snapshot the entire webpage and send it to the Yamanote server as an archive (in technical terms, it’ll serialize the DOM). This is great for (1) paywalled content you had to log in to read, (2) Twitter, which makes it hard for things like Pinboard to archive, etc. The server will download any images—and optionally videos—in your bookmarked sites. You can browse Yamanote’s snapshot of the URL (it might look weird because we block custom JavaScript in the mirror and lots of sites these days look weird with just HTML and CSS—shocking I know). Nobody except you can see your bookmarks, comments, or archives.
A flatbed document and book scanner. Will also scan 3d objects that'll fit under the camera. Minimum of 13MP image resolution (4160 x 3120), can handle up to A3 size documents. Maximum document thickness: 10mm. Scanner camera's height above the document is adjustable. As fast as one second per scan. Portable - can be folded up for transportation. Can detect when you turn the page or change the document, look for the new page, and automatically take the next image. Abbyy OCR functionality built in. Scans to Word documents, PDF, Excel spreadsheets, or TIFF image files. Software for Windows (back to XP) and OS X.
Shows up as a UVC device under Linux (archived), so any image or video capture software that is UVC enabled can do the work for you.
Keir Fraser’s Greaseweazle is a project for versatile floppy drive control over USB. By extracting the raw flux transitions from a drive, any diskette format can be captured and analyzed - PC, Amiga, Amstrad, PDP-11, many older electronic musical instruments, and industrial equipment. The Greaseweazle also supports writing to floppy disks. The design is fully open and comes with no license encumberment.
A companion code library, Disk-Utilities, converts between flux images and multiple, standardized floppy disk image file formats. These can then be used in hardware floppy emulators, like the Gotek or FlashFloppy, or as disk images in hundreds of pure software emulators.
Very inexpensive! $30us!
The FluxEngine is a very cheap USB floppy disk interface capable of reading and writing exotic non-PC floppy disk formats. It allows you to use a conventional PC drive to accept Amiga disks, CLV Macintosh disks, bizarre 128-sector CP/M disks, and other weird and bizarre formats. (Although not all of these are supported yet. I could really use samples.)
The hardware consists of a single, commodity part with a floppy drive connector soldered onto it. No ordering custom boards, no fiddly surface mount assembly, and no fuss: nineteen simpler solder joints and you’re done. You can make one for $15 (plus shipping).
Github: https://github.com/davidgiven/fluxengine
I might even have the board it requires in my drawers someplace. It looks suspiciously familiar.
As I haven't found a good source on archiving your personal collection of Atari software on floppy disk, I documented my own progress, so others might benefit from it.
I started looking for methods to copy my floppies to a PC so that when my 1050(s) break down, I still have some of my source code, letters, games, etc. As I only have recent hardware in the form of Apple, PC (intel) 'antiques' - albeit almost 20 years younger than my atari's - laptops from Y2k or a little bit more recent and several 'embedded' stuff in the form of Arduino and Raspberry Pi's, I started this journey by looking into the various methods that are available to hook up one of the aforementioned devices to my Atari and 1050 setup so I could start archiving.
Paperless-ngx is a document management system that transforms your physical documents into a searchable online archive so you can keep, well, less paper. Paperless-ngx forked from paperless-ng to continue the great work and distribute responsibility of supporting and advancing the project among a team of people.
Paperless-ngx is a webapp that indexes your scanned documents and allows you to easily search for documents and store metadata alongside your documents. Paperless-ngx does not control your scanner, it only helps you deal with what your scanner produces.
Store archived documents with an embedded OCR text layer, while keeping originals available.
RadioWitness is a P25 public safety radio archive with a web application and support for cryptographically authenticated mirrors through Dat Protocol. Running this software requires two or more RTLSDR radios and one or more local P25 "Phase 1" public safety radio networks.
It looks like reading through the documentation alone will help in building a trunk tracker.
Chupacabra enables users to archive and discuss web content free of surveilance and commercial influence. It can be used for personal research, micro-blogging, or discussing dank memes. Chupa posts are standalone archives of web content (a single HTML file with images embedded and scripts removed) and a corresponding Matrix message pointing to the mxc:// URI where the archive can be fetched. Posts can be discussed in real-time in the channel that they were shared. Behind the scenes, all post discussion is composed of replies to the post's Matrix message.
Perma.cc helps scholars, journals, courts, and others create permanent records of the web sources they cite. Perma.cc is simple, free to use, and is built and supported by libraries. Free to use at 10 records per month, unless you're an academic. Designed specifically so that it can be used for citations. Has an API. Open source. Written in Python, uses Django. Stores things as WARC files.
Zopfli is a new data compression algorithm from researchers at Google which seems to work better than most existing compression algorithms out there. The trade off is that it can take much longer to finish compressing files, but more space and transmission time are saved. It is recommended for compress-once-distribute-many-times use cases.
A howto for activists that describes how to capture and archive video footage. Includes archival of metadata, keeping files intact, raw and edited video concerns, organization, storage concerns, cataloging, sharing, and preservation. Treats it in a verifiable, library-like manner. Can be downloaded, too.
witness.org's library of reference and training materials for activists, instructors, and allies. video production, recording LEO actions, archival, how to work with survivors, camera specifics and trainings, data science, covering protests, collecting evidence, crimes, and field guides.
A set of tools written in python for downloading and archiving wiki sites. Tries to find and use the API if it's enabled on the site; a path to same can be specified if need be.
An online archival service for academics, like archive.fo. Requires an account. Has a REST API. Written in python: https://github.com/harvard-lil/perma