Bookmarks
Tag cloud
Picture wall
Daily
RSS Feed
  • RSS Feed
  • Daily Feed
  • Weekly Feed
  • Monthly Feed
Filters

Links per page

  • 20 links
  • 50 links
  • 100 links

Filters

Untagged links
18 results tagged datasets  ✕   ✕
Webhose/fake-news-dataset https://github.com/Webhose/fake-news-dataset
Wed 26 Feb 2025 03:04:27 PM PST archive.org

This repository, created by Webz.io, provides free datasets of publicly available news articles identified as originating from fake news websites. We release new datasets weekly, each containing approximately 1,000 articles sourced from these flagged sites. These datasets provide valuable resources for researchers, analysts, and journalists studying misinformation and disinformation trends.

Updated weekly. Sites in the datasets are verified through multiple sources. Metadata, including sentiment analysis, categories, publication dates, and source trust level is included. Covers politics, health, finance, and other key domains where misinformation is prevalent.

Each dataset is a .zip archive containing multiple JSON documents.

datasets fakenews websites campaigns operations politics finance news json
SciOp: Collecting at-risk data in torrent rss feeds https://sciop.net/
Sun 23 Feb 2025 09:23:42 PM PST archive.org

SciOp is part of Safeguarding Research & Culture (SRC).
Using RSS feeds full of vetted torrents, we can ensure that our cultural, intellectual and scientific heritage exists in multiple copies, in multiple places, and that no single entity or group of entities can make it all disappear.

Has a page of datasets it's protecting and a large number of RSS feeds tracking the health of other data sets (i.e., have they disappeared or not).

Git: https://codeberg.org/Safeguarding/sciop

python archival rss opendata uspol censorship datasets torrents
publicdata/us-national-archives-and-publications https://git.lsit.ucsb.edu/publicdata/us-national-archives-and-publications
Sun 23 Feb 2025 09:22:24 PM PST archive.org

Public data dumps from US National Archives and US Government Publishing Office (https://www.archives.gov/). Metadata is stored in each data.json file.

The datasets themselves are stored with git-lfs so be sure you have that installed before you clone that repo.

opendata archive uspol censorship datasets
Data Hoarding https://datahoarding.org/
Fri 07 Feb 2025 04:09:43 PM PST archive.org

DataHoarding.org is an index of resources and archives related to data hoarding, web archival and self hosting. It was inspired by the recent purge of online information by govenment agencies, corporations and others, and aims to provide easier access to tools and information. The goal is not only to hoard data, but parse and index it as well.

Among the mirrors they have:

  • FEMA
  • kernel.org
  • Kiwix
  • NOAA
  • textfiles.com
  • Youtube
archives opendata government censorship datasets politics directories
Github: MassMove https://github.com/MassMove
Fri 02 Aug 2024 12:57:53 PM PDT archive.org

MassMove is a group effort to sway public opinion towards the interests of the masses. There are multiple repositories in multiple languages.

politics tools advertising analytics fakenews datasets
Our World in Data https://ourworldindata.org/
Thu 29 Feb 2024 07:57:38 PM PST archive.org

To make progress against the pressing problems the world faces, we need to be informed by the best research and data. Our World in Data makes this knowledge accessible and understandable, to empower those working to build a better world.

opendata datasets society medicine disease finance energy education statistics
Aleph: Find public records and leaks. https://search.burojansen.nl/
Tue 09 Jan 2024 09:55:32 PM PST archive.org

Aleph is a powerful tool for people who follow the money. It helps investigators to securely access and search large amounts of data - no matter whether they are a government database or a leaked email archive.

Requires a (free?) account?

Consider adding to Searx?

searchengine people money companies organizations datasets leaks investigation littlesister ironmonger
bytewax/awesome-public-real-time-datasets https://github.com/bytewax/awesome-public-real-time-datasets
Tue 25 Jul 2023 02:36:01 PM PDT archive.org

This list is inspired by awesome public datasets, but for real-time datasets and sources. Normally accessed via HTTP or Websockets.

The list is separated into Free and Paid and broken into subsections based on loose categories.

awesome datasets directory finance scheduling data rest api
Have I Been Trained? https://haveibeentrained.com/
Fri 09 Dec 2022 02:03:27 PM PST archive.org

HaveIBeenTrained uses clip retrieval to search the Laion-5B and Laion-400M image datasets. These are currently the largest public text-to-image datsets, and they are used to train models like Stable Diffusion, Imagen, among many others.

When it's time to train a generative AI system, organizations like Stability use those datasets to download the images from their links and present them to the model with their captions.

With HaveIBeenTrained, artists can search these databases for links to their work and flag them for removal. We partner with Laion, who built these datasets, to remove those links. This helps ensure that future models will not be trained with work that has been opted out.

ai ml images search datasets optout
GitHub - ipfs/awesome-ipfs https://github.com/ipfs/awesome-ipfs
Sat 24 Oct 2020 06:16:14 PM PDT archive.org

Useful resources for using IPFS and building things on top of it.

awesome list ipfs applications articles datasets archives tools
MassMove/AttackVectors https://github.com/MassMove/AttackVectors
Tue 10 Mar 2020 01:41:22 PM PDT archive.org

A repository for monitoring attack vectors mentioned in the billion-dollar disinformation campaign to reelect the president in 2020. Includes some Python code for analyzing the data.

lists fakenews directory research domains datasets metrics socialnetworks socialengineering exocortex edison
awesomedata/awesome-public-datasets https://github.com/awesomedata/awesome-public-datasets
Sat 31 Mar 2018 08:15:02 PM PDT archive.org

A topic-centric list of high-quality open datasets in public domains. By everyone, for everyone!

awesome github data datasets research
JSON Generator – Tool for generating random data http://www.json-generator.com/
Tue 20 Mar 2018 12:20:21 AM PDT archive.org

This website generates random JSON documents, suitable for use as test data or learning how to write and interface with various APIs.

development datasets generator javascript utilities random json testing online
AWS Public Datasets https://aws.amazon.com/datasets/
Tue 20 Mar 2018 12:10:32 AM PDT archive.org

Free datasets made available by Amazon. Stuff like an atlas of the Galactic Plane, NASA NEX data, the Human Microbiome Project, the Enron emails, Freebase, and the Marvel Universe's socialgraph. The Google Books Ngrams corpus is in here, also, alongside the Westbury Lab USENET corpus.

information microbiome datasets ai ml ngrams freebase socialgraph free exocortex google usenet nasa corpus marvel data enron
Mecodify | MeCoDEM http://www.mecodem.eu/mecodify/
Tue 20 Mar 2018 12:01:55 AM PDT archive.org

An opensource tool for the visualization of extremely large datasets, like twitter maps or email databases.

tables datasets maps tools exocortex graphs visualization networkanalysis databases
Welcome - Data Refuge https://www.datarefuge.org/
Mon 19 Mar 2018 11:50:03 PM PDT archive.org

The datarefuge website. Probably as official as it's going to get. Has some useful definitions, at least. opendata Also has a bunch of rescued datasets available for download. data

website datasets opendata download definitions datarefuge data
Home | data.world https://data.world/
Mon 19 Mar 2018 10:39:00 PM PDT archive.org

socnet and archive of public and open data for research and study. Encourages people to upload their own datasets for others to use. I use Github to authenticate.

datasets data public socialnetworks open archives
UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php
Mon 19 Mar 2018 10:38:03 PM PDT archive.org

Vast collections of data suitable for training and teaching AI ML software.

datasets ai ml free research download data archives
6298 links, including 411 private
Shaarli - The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community - Theme by kalvn