Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples that can then be queried against in a microsecond's time.
Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query. Sonic is an identifier index, rather than a document index; when queried, it returns IDs that can then be used to refer to the matched documents in an external database.
A strong attention to performance and code cleanliness has been given when designing Sonic. It aims at being crash-free, super-fast and puts minimum strain on server resources (our measurements have shown that Sonic - when under load - responds to search queries in the μs range, eats ~30MB RAM and has a low CPU footprint
Available in Arch as extra/sonic.
Configuration docs: https://github.com/valeriansaliou/sonic/blob/master/CONFIGURATION.md
This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity. We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases. Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task. CUDA enabled.
Seems to lend itself to research coding. The real winner here is that you can generate embeddings and vectors for arbitrary text, which would make it ideal for writing a utility that could do only this without a lot of heavy lifting.
Comes with pre-trained models for over 100 languages. Has documentation and examples for building your own models.
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.
Pre-trained word vectors can be downloaded.
Uses sdxl-emoji to turn natural language descriptions into custom emoji.
Github: https://github.com/cbh123/emoji
Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it doesn't rely on proprietary providers such as Google or Azure to perform translations. Instead, its translation engine is powered by the open source Argos Translate library.
Supports per-user limit quotas, e.g. you can issue API keys to users so that they can enjoy higher requests limits per minute (if you also set --req-limit). By default all users are rate-limited based on --req-limit, but passing an optional api_key parameter to the REST endpoints allows a user to enjoy higher request limits. To use API keys simply start LibreTranslate with the --api-keys option.
There are also F/OSS mobile clients for Android and browser plugins.
A curated list of delightful Conversational AI resources.
Reddit Persona is a python module that extracts personality insights, sentiment & interests from a user account. Support for subreddit analysis not working due to praw update v3--> v5, fix incoming ).
Text is collected via reddit's python API, praw, and NLP is powered by the indico.io API.
Intellexer™ is a linguistic platform developed by EffectiveSoft.
Our API and SDK incorporate powerful linguistic tools for analyzing text in natural language. We encourage both developers and integrators to use them for improving existing or creating new Document/Knowledge management systems.
Our API and SDK provide effective capabilities for the development of various semantics-based solutions. The solutions can vary in the number and algorithmic complexity of the linguistic instruments used, depending on the customer's needs.
Free API key.
High performance NLP models as a service. Pre-trained. You can upload and run your own spaCy models as well. Seems to be GPU accelerated on the back-end because they're an nVidia partner.
Named entity recognition, classification, summarization, question in context answering, sentiment analysis, part of speech tagging.
Free tier: All pre-trained models, 3 API requests per minute.
Starter tier: All pre-trained models, 15 requests per minute, $39us/month
Lingua Franca is our multilingual Natural Language Processing library. It allows Mycroft to both understand and respond with naturally expressed entities such as numbers, dates and times. Stand-alone Python module. Ready-to-use and currently has support for Danish, Dutch, English, French, German, Hungarian, Italian, Portuguese, Spanish, and Swedish. Heuristic parsing routines to extract numbers, dates, times, or durations from a spoken language transcription. Natural language formatters for numbers, dates, times and durations as well as utilities for working with lists in multiple languages. Can reformat figures so they can be better pronounced by a synthesizer. Extract information from text to use in figuring out what the user wants and grab the stuff needed to do it.
Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spaCy library. With the fundamentals --- tokenization, part-of-speech tagging, dependency parsing, etc. --- delegated to another library, textacy focuses primarily on the tasks that come before and follow after. Abstracts away the boilerplate for the stuff you actually care about.
Quickstart: https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html
A bot implemented as a Github App which analyzes the interactions a user has had elsewhere on Github and uses sentiment analysis to figure out how toxic the user is likely to be in their interactions with your project.
Uses the Probot framework.
A F/OSS natural language translation system that seems to want to give Google Translate a run for its money. The corpuses used for training appear to be crowdsourced, and I think you can download the trained models on their own. Aims to be self-hosted.
Github: https://github.com/apertium
Installation docs: http://wiki.apertium.org/wiki/Installation
An NLP deep learning toolkit for building training pipelines. Tries to minimize the effort for constructing the training and inference stages. Defines modular building blocks of neural network components, and a suite of NLP models. The end goal is to make building a neural network as easy as playing with Legos. Supports English and Chinese.
A deep learning NLP modeling framework based on PyTorch. Text classifiers, sequence taggers, joint intent-slot models.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the MIT License]. Incorporated into NLTK.
spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 45+ languages. It features the fastest syntactic parser in the world, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration. It's commercial open-source software, released under the MIT license.