Free datasets made available by Amazon. Stuff like an atlas of the Galactic Plane, NASA NEX data, the Human Microbiome Project, the Enron emails, Freebase, and the Marvel Universe's socialgraph. The Google Books Ngrams corpus is in here, also, alongside the Westbury Lab USENET corpus.
A massive corpus of annotated and tagged ngrams for use in machine learning and NLP. Free to download. It'll take time to grab it all because it's so finely split up. Creative Commons licensed (BY-NC-SA v3). Relationships between parts of speech and words are broken out, also.