EleutherAI is a grassroots AI research group aimed at democratizing and open sourcing AI research. Multiple projects and usable training corpora. F/OSS model called GPT-Neo.
Several spinoff projects to investigate.
A public domain collection of corpora for training AI ML bots. Consists of many YAML files containing key/value data on many different subjects. Each category contains multiple documents about different related subjects. You won't be able to drop these into your code randomly, you'll need to write a fairly simple parser tuned to the document's schema. There are several libraries in different programming languages for efficiently using one or more of these files in your own project.
4465 links, including 313 private