We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. The Common Crawl dataset lives on Amazon S3 as part of the Amazon Web Services’ Open Data Sponsorships program. You can download the files entirely free using HTTP(S) or S3. Our goal is to democratize the data so everyone, not just big companies, can do high quality research and analysis.
A multithreaded hyperlink checker that crawls a site and looks for 404s. Unfortunately, not maintained anymore and written in Python2. Still useful.