This is a playground for learning and testing CSS selectors in a visual way. Start by selecting a playground or let's start with a random selector.
Webcrawlers/bots often identify themselves in the user agent string. Well it turns out, up until now, a huge majority of my bandwidth usage has come from bots scraping my site thousands of times a day.
A robots.txt file can advertise that you don't want bots to crawl your site. But it's completely voluntary—a bot may happily ignore it and scrape your site anyway. And I'm fine with webcrawlers indexing my site, so that it might be more discoverable. It's the bandwidth hogs that I want to block.
A blog post detailing OpenAI's IP ranges and suggestions for blocking them.
Carbonyl is a Chromium based browser built to run in a terminal. It supports pretty much all Web APIs including WebGL, WebGPU, audio and video playback, animations, etc. It's snappy, starts in less than a second, runs at 60 FPS, and idles at 0% CPU usage. It does not require a window server (i.e. works in a safe-mode console), and even runs through SSH. Carbonyl originally started as html2svg and is now the runtime behind it.
pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.
This is not (or at least not yet) a library to produce or interact with javascript code. I just liked the jquery API and I missed it in python so I told myself "Hey let's make jquery in python". This is the result.
A list of (almost) all headless web browsers in existence.
A collection of applications able to interact with websites, without requiring the user to open them in a browser. It also provides well-defined APIs to talk to websites lacking one. Automate access and extraction of data from websites that don't make it easy or possible. Has applications for adding accessibility to sites that are unfriendly to the visually impaired. Tries to focus on quality of results. Multiple interfaces.
This code demonstrates how to scrape the Doomsday Clock to get the current value. It has a CSS selector, source, and regular expression to extract the current time.
A Python module that tries to make parsing HTML as easy to do as Requests makes HTTP requests easy. Written by the same developer, in fact. Built on top of Requests, so you don't have to juggle both. Python v3.6 and later only. Full Javascript support(!), CSS selectors, XPath selectors, user-agent spoofing, automatic redirects.
SelectorGadget is a bookmarklet (which works on pretty much any browser) or a Chrome Plugin that makes it easy to generate a CSS selector for writing CSS or scraping web pages to pick out only the bits you're interested in. Click on the part of the page you want, then click on a part you don't want (if there is one). Lets you pick out multiple bits and stack them into a single selector.
A python module (Python3, specifically - Python2 support was obsoleted) that tries to be the Requests of HTML scraping. Designed with news sites in mind. Picks out names of authors, publication dates, text, URLs to images, any embedded media. keyword analysis. NLP Picks articles out of websites. URL extraction. Picks out categories. i18n support.
Documentation here: https://newspaper.readthedocs.io/en/latest/
An excellent summary of CSS selectors!