Bookmarks
Tag cloud
Picture wall
Daily
RSS Feed
  • RSS Feed
  • Daily Feed
  • Weekly Feed
  • Monthly Feed
Filters

Links per page

  • 20 links
  • 50 links
  • 100 links

Filters

Untagged links
8 results tagged corpus  ✕   ✕
LOCO: the 88-million word language of conspiracy corpus https://osf.io/snpcg/
Wed 16 Mar 2022 03:31:17 PM PDT archive.org
dataset corpus conspiracies linguistics nlp
Corpus of Contemporary American English (COCA) http://corpus.byu.edu/coca/
Tue 20 Mar 2018 12:10:35 AM PDT archive.org

A corpus of over 520 million words which consists of a massive cross-section of the english language between 1990 and 2015. This corpus is used for NLP study, AI training, and lingustic analysis. There's an online service, you can download various forms of it, and you can add to it if you have access.

nlp lingustic language ai online construct exocortex service english download corpus dixieflatline betafork
AWS Public Datasets https://aws.amazon.com/datasets/
Tue 20 Mar 2018 12:10:32 AM PDT archive.org

Free datasets made available by Amazon. Stuff like an atlas of the Galactic Plane, NASA NEX data, the Human Microbiome Project, the Enron emails, Freebase, and the Marvel Universe's socialgraph. The Google Books Ngrams corpus is in here, also, alongside the Westbury Lab USENET corpus.

information microbiome datasets ai ml ngrams freebase socialgraph free exocortex google usenet nasa corpus marvel data enron
Google Books Ngram Datasets https://storage.googleapis.com/books/syntactic-ngrams/index.html
Tue 20 Mar 2018 12:10:26 AM PDT archive.org

A massive corpus of annotated and tagged ngrams for use in machine learning and NLP. Free to download. It'll take time to grab it all because it's so finely split up. Creative Commons licensed (BY-NC-SA v3). Relationships between parts of speech and words are broken out, also.

nlp machinelearning ml ngrams download creativecommons corpus
dariusk/corpora: A collection of small corpuses of interesting data for the creation of bots and similar stuff. https://github.com/dariusk/corpora
Mon 19 Mar 2018 10:43:57 PM PDT archive.org

A public domain collection of corpora for training AI ML bots. Consists of many YAML files containing key/value data on many different subjects. Each category contains multiple documents about different related subjects. You won't be able to drop these into your code randomly, you'll need to write a fairly simple parser tuned to the document's schema. There are several libraries in different programming languages for efficiently using one or more of these files in your own project.

nlp publicdomain nlu ai ml corpora languages yaml keyvalue pd bots foss data corpus schema text
Common Voice https://voice.mozilla.org/
Mon 19 Mar 2018 10:34:05 PM PDT archive.org

Mozilla's open source speech recognition project. They're asking people to contribute samples of themselves speaking sentences on the screen to grow their corpus.

stt foss samples speechtotext corpus data speechrecognition
Data Format — chatterbot-corpus 1.0.0 documentation http://chatterbot-corpus.readthedocs.io/en/latest/data.html
Mon 19 Mar 2018 10:25:22 PM PDT archive.org

Documentation for the Chatterbot corpus file format.

reference format documentation exocortex yaml betafork corpus
Directory Contents http://files.pushshift.io/reddit/comments/
Mon 19 Mar 2018 10:12:03 PM PDT archive.org

Much more up to date archive of reddit comments, updated daily as well as monthly. No rss feed but I should be able to create one someplace.

nlp reddit comments corpus data archives
6420 links, including 414 private
Shaarli - The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community - Theme by kalvn