This list is inspired by awesome public datasets, but for real-time datasets and sources. Normally accessed via HTTP or Websockets.
The list is separated into Free and Paid and broken into subsections based on loose categories.
HaveIBeenTrained uses clip retrieval to search the Laion-5B and Laion-400M image datasets. These are currently the largest public text-to-image datsets, and they are used to train models like Stable Diffusion, Imagen, among many others.
When it's time to train a generative AI system, organizations like Stability use those datasets to download the images from their links and present them to the model with their captions.
With HaveIBeenTrained, artists can search these databases for links to their work and flag them for removal. We partner with Laion, who built these datasets, to remove those links. This helps ensure that future models will not be trained with work that has been opted out.
Useful resources for using IPFS and building things on top of it.
A repository for monitoring attack vectors mentioned in the billion-dollar disinformation campaign to reelect the president in 2020. Includes some Python code for analyzing the data.
A topic-centric list of high-quality open datasets in public domains. By everyone, for everyone!
This website generates random JSON documents, suitable for use as test data or learning how to write and interface with various APIs.
Free datasets made available by Amazon. Stuff like an atlas of the Galactic Plane, NASA NEX data, the Human Microbiome Project, the Enron emails, Freebase, and the Marvel Universe's socialgraph. The Google Books Ngrams corpus is in here, also, alongside the Westbury Lab USENET corpus.
An opensource tool for the visualization of extremely large datasets, like twitter maps or email databases.
The datarefuge website. Probably as official as it's going to get. Has some useful definitions, at least. opendata Also has a bunch of rescued datasets available for download. data
A wiki of resources for people writing bots - actual bots to interact with, tools, tutorials, code, and datasets. exocortex chatbots howto twitter
socnet and archive of public and open data for research and study. Encourages people to upload their own datasets for others to use. I use Github to authenticate.
Vast collections of data suitable for training and teaching AI ML software.