A corpus of over 520 million words which consists of a massive cross-section of the english language between 1990 and 2015. This corpus is used for NLP study, AI training, and lingustic analysis. There's an online service, you can download various forms of it, and you can add to it if you have access.
Free datasets made available by Amazon. Stuff like an atlas of the Galactic Plane, NASA NEX data, the Human Microbiome Project, the Enron emails, Freebase, and the Marvel Universe's socialgraph. The Google Books Ngrams corpus is in here, also, alongside the Westbury Lab USENET corpus.
A massive corpus of annotated and tagged ngrams for use in machine learning and NLP. Free to download. It'll take time to grab it all because it's so finely split up. Creative Commons licensed (BY-NC-SA v3). Relationships between parts of speech and words are broken out, also.
A public domain collection of corpora for training AI ML bots. Consists of many YAML files containing key/value data on many different subjects. Each category contains multiple documents about different related subjects. You won't be able to drop these into your code randomly, you'll need to write a fairly simple parser tuned to the document's schema. There are several libraries in different programming languages for efficiently using one or more of these files in your own project.
Mozilla's open source speech recognition project. They're asking people to contribute samples of themselves speaking sentences on the screen to grow their corpus.
Documentation for the Chatterbot corpus file format.
Much more up to date archive of reddit comments, updated daily as well as monthly. No rss feed but I should be able to create one someplace.
3750 links, including 200 private