A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.
Webcrawlers/bots often identify themselves in the user agent string. Well it turns out, up until now, a huge majority of my bandwidth usage has come from bots scraping my site thousands of times a day.
A robots.txt file can advertise that you don't want bots to crawl your site. But it's completely voluntary—a bot may happily ignore it and scrape your site anyway. And I'm fine with webcrawlers indexing my site, so that it might be more discoverable. It's the bandwidth hogs that I want to block.
grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses a fork of wpull for crawling. Gives you a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more; the ability to add ignore patterns when the crawl is already running; an extensively tested default ignore set (global) as well as additional (optional) ignore sets for forums, reddit, etc; duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.
Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed. At the moment it is little more than an idea together with a proof of concept implementation of the web front-end and search technology on a small index. Our vision is a community working to provide top quality search particularly for hackers, funded purely by donations.
We now have a distributed crawler that runs on our volunteers' machines! If you have Firefox you can help out by installing our extension. This will crawl the web in the background, retrieving one page a second. It does not use or access any of your personal data. Instead it crawls the web at random, using the top scoring sites on Hacker News as seed pages. After extracting a summary of each page, it batches these up and sends the data to a central server to be stored and indexed.
Seems to require Postgres.
If you try installing it with Poetry you'll bounce off of newer versions of Python (it specifically looks for <= v3.11), but if you pick apart poetry.lock and do things manually you might have better luck.
Create an ai.txt file for your website to set permissions for text and data mining. Use the toggles to allow or block your content from being used to train AI models. By default all content is opted out. Selecting allow for any content type will let data miners know that they may use content on your website of that media type.
There is no guarantee that anybody will ever obey this, but it can't hurt to try.
Zimit is a scraper that allows you to create ZIM file from any Web site.
While we would like to support as many websites as possible, making an offline archive of any website with a versatile tool obviously has some limitations.
Most capabilities and known limitations are documented in warc2zim README. There are also some limitations in Browsertrix Crawler (used to fetch the website) and wombat (used to properly replay dynamic web requests), but these are not (yet?) clearly documented.
Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.
The system runs a website crawl with Browsertrix Crawler, which produces WARC files, and converts the crawled WARC files to a single ZIM using warc2zim. After the crawl is done, warc2zim is used to write a zim to the /output directory, which should be mounted as a volume to not loose the ZIM created when container stops.
A collection of awesome web crawl, scraping, and spidering projects in different languages.
The homepage of a distributed search engine project. The project involves downloading and running a cross-platform spider (available for Windows, Linux, FreeBSD, MacOSX, and pretty much any OS which can run Mono) that will then crawl the web and upload what it finds to the project. This can use lots of bandwidth so consider carefully before joining in.
A search engine for Tor hidden services. The problem is that it's on the public Net so if you don't want your services known you'll have to take additional measures. It also exposes your activity on the public Net so don't think that you'll have much privacy.