This is an open list of web crawlers associated with AI companies and the training of LLMs to block. We encourage you to contribute to and implement this list on your own site.
You can subscribe to updates the releases feed: https://github.com/ai-robots-txt/ai.robots.txt/releases.atom
If you just want to pull a robots.txt file: https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt
uBlacklist lists for the 16 Companies that dominate search results. These lists were inspired by the article How Google is killing independent sites like ours on HouseFresh and Detailed.com's How 16 Companies are Dominating the World’s Google Search Results.
A collection of awesome resources for running your own federated social media website.
A huge blocklist of sites that contain AI generated content, for the purposes of cleaning image search engines (Google Search, DuckDuckGo, and Bing) with uBlock Origin or uBlacklist. There is also a Pi-Hole compatible list in the repo.
list.txt can probably be processed and used to build a blocking database for search bots.
uBlacklist is a browser extension which prevents blacklisted sites from appearing in Google, Bing or DDG search results. This is a directory of blocklists that can be used with it.
A selection of blocklists for uBlacklist. Includes AI generated content sites, website clones, specific problem sites (like Pinterest), spam and SEO sites. Each appears to have its own Git repo.
Each list appears to have wildcarded URLs in it (e.g., *://algebra.com/*
), which might or might not be useful in other contexts.
This folder contains scripts to generate the blocklist files. See the blocklists folder for blocklist files.
This is a curated place to find server blocklists for your own use. An algorithm combines multiple Trusted Source blocklists together and gives you a great deal of choice on which blocklist you want to use, along with transparency into how these are derived.
I'm sharing this with others who want to start their Mastodon instance with a sensible list of domains that should be defederated. Download any of the lists on this page you like, depending on your desired amount of blocking.
The blocklist that chaos.social maintains. Updated regularly.
We have #fediblock to let others know of bad people or instances, but some people find it hard to scroll through a ton of posts and verify the validity of the post. Sometimes an instance isn't bad, but has stuff you may not want to see or is inappropriate for your instance, such as porn.
By "bad instance(s)," I mean instances that are generally bigoted, pedophilic, and/or host genuinely harmful content.
A list of instances in the Fediverse that are blocked and the reasons for it. Updated recently.
This is an open project to maintain a list of domain names that serve YouTube ads. The original project only produced a Pi-hole blocklist, but this new version automatically generates multiple list formats, including Pi-Hole compatible lists.
A blocklist for QAnon, conspiracy, fake news, nazi websites for multiple applications, including web browser adblockers, DNSes, and even /etc/hosts. It looks like the lists (which are substantially identical in content) could be used to compile a database of known-bad domains. IPv4 and IPv6 supported.
A curated blocklist of known fake news sites, suitable for use with adblockers or other countermeasures. Still updated fairly frequently.
The list itself, suitable for adding to a Pi-Hole or adblocking addon: https://raw.githubusercontent.com/StevenBlack/hosts/master/extensions/fakenews/hosts
iblocklist.com makes available many lists of IP addresses in several formats that can be dropped into firewalls or applications to prevent connections attempts from those hosts. Among the lists are known spammers, spyware servers, open proxies, advertising services, governments, and anonymizing services. Useful for perimeter security.