A data hoarder’s dream come true: bundle any web page into a single HTML file. You can finally replace that gazillion of open tabs with a gazillion of .html files stored somewhere on your precious little drive.
Unlike the conventional “Save page as”, monolith not only saves the target document, it embeds CSS, image, and JavaScript assets all at once, producing a single HTML5 document that is a joy to store and share.
If compared to saving websites with wget -mpk
, this tool embeds all assets as data URLs and therefore lets browsers render the saved page exactly the way it was on the Internet, even when no network connection is available.
In the Arch package repos.
This repository, created by Webz.io, provides free datasets of publicly available news articles identified as originating from fake news websites. We release new datasets weekly, each containing approximately 1,000 articles sourced from these flagged sites. These datasets provide valuable resources for researchers, analysts, and journalists studying misinformation and disinformation trends.
Updated weekly. Sites in the datasets are verified through multiple sources. Metadata, including sentiment analysis, categories, publication dates, and source trust level is included. Covers politics, health, finance, and other key domains where misinformation is prevalent.
Each dataset is a .zip archive containing multiple JSON documents.
The mission is simple: to preserve and share LGBTQIA+ history. Originally a small project to organize historical resources, it has grown into a free, searchable digital archive accessible to educators, researchers, and anyone interested in queer history. The initiative is dedicated to making LGBTQIA+ history available to all and aims to expand into a comprehensive portal for LGBTQIA+ education. Partnerships with like-minded organizations are welcomed to preserve history through shared projects and initiatives.
Old'aVista is a search engine focused on personal websites that used to be hosted on services like Geocities, Angelfire, AOL, Xoom and so on. In no way it should compete with any of the famous search engines as it's focused on finding historic personal websites. The data was acquired by scraping pages from the Internet Archive. I basically used a node application I built with some starting links and I saved all the links I found in a queue and the text from the pages in the index. Old'aVista's design is based on the 1999 version of the defunct Altavista search engine. The name of the website itself is a wordplay on Altavista. My original idea as to get old search engines and make them functional again, but I decided to make this website its own thing while maintaining the nostalgia factor.
Bypass Paywalls web browser extension for Firefox and its derivatives. This extension is not available on the Firefox Addons Directory..
In the source code should be the resources I need to add more sites to Antigone.
This site was built to share and revel in each others’ personal sites. Witness these in wonderment and awe. Immaculate. Stunning. How did they do that? Yes, you should definitely get around to redesigning yours soon.
They're useless. They're fun. They waste time.
Curated list of personal blogs on any topic, by mataroa.blog.
Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications. Used by just about anything that uses FTR.
The Lurker's Guide to Babylon-5 has been checked into Github.
A curated list of assets available on the Internet related to Geocaching.
Geocaching is an outdoor recreational activity, in which participants use a Global Positioning System receiver or mobile device and other navigational techniques to hide and seek containers, called geocaches or caches, at specific locations marked by coordinates all over the world.
All of the paywall removers in one place. Simply enter the URL of the article and click the archive buttons to remove any paywall.
Use Python to map a website's external facing links. And then apply D3 to visualize those outbound connections as a network graph.
You are viewing a humanly curated list of fine personal & independent blogs that are updated regularly. No algorithms ever!
Webcrawlers/bots often identify themselves in the user agent string. Well it turns out, up until now, a huge majority of my bandwidth usage has come from bots scraping my site thousands of times a day.
A robots.txt file can advertise that you don't want bots to crawl your site. But it's completely voluntary—a bot may happily ignore it and scrape your site anyway. And I'm fine with webcrawlers indexing my site, so that it might be more discoverable. It's the bandwidth hogs that I want to block.
A personal website that bills itself as unix_surrealism.
A blog post detailing OpenAI's IP ranges and suggestions for blocking them.
Ghostarchive is a free-to-use archiving website designed to be fast and easy to use.
The process is simple: all one has to do is enter in the link of the page they would like to be archived, and Ghostarchive will store a snapshot of the website as it appeared at the time of archival. The snapshot will include any images and framed content. For some websites, videos are also saved.
There are two main archival replay systems: one is based on Webrecorder technology, which can execute scripts in a sandbox, allowing for "high-fidelity" snapshot replay. However, this "Webrecorder" technology makes use of Service Workers, which requires Javascript and an updated browser. You also need to be able to connect to the HTTPS version of the website.
For readers that choose to browse the web with Javascript disabled, use the HTTP version of the site, or use a browser that does not support Service Workers, an additional archival replay system is avaliable and can be accessed on any archived page. This "noscript" system does not rely on Javascript or any fancy web technologies. For the vast majority of sites, both replay systems work, but there are some sites that will only work with Webrecorder, and others that will only work with the "noscript" replay system.
Ghostarchive is in the process of adding support for the MementoWeb API. In the mean time, making direct web queries and scraping the result should suffice.
A site where somebody aggregates links to personal and weird little sites, like we all remember in the early days.