This software can run as a back-end for a webserver (like nginx), in order to trickle out a Markov chain generated output. The intended use is tarpitting "AI" bots while feeding them, slowly, useless data.
Source strings fed to the chain have been generated by the Postmodernism Generator.
A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.
A company that does web scraping for you. Automatic retries, lots of proxies, geolocation, CAPTCHA bypass (eh?), Javascript support. Has a library of scrapers for different online services.
The free tier has only 1000 API calls. Multiple tiers of features.
A webapp for gathering data on stocks you might want to purchase. Builds a history of performance to analyze. Supports research code for arbitrary queries. Seems to require MongoDB for its back-end.
Docker webshit, but can be run outside of that context.
A Python module for scraping Craigslist to extract data.
A collection of awesome web crawl, scraping, and spidering projects in different languages.
A PHP application that scrapes some sites and generates RSS feeds for them.