Eric A. Meyer has been working with the web since late 1993 and is an internationally recognized expert on the subjects of HTML, CSS, and web standards. A widely read author, he was technical lead at Rebecca’s Gift, a 501(c)(3) non-profit organization dedicated to providing healing family vacations after the death of a child; and was, along with Jeffrey Zeldman, co-founder of the web conference series An Event Apart (2005–2021).
RSS: https://meyerweb.com/eric/thoughts/feed/
ATOM: https://meyerweb.com/eric/thoughts/feed/atom/
A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.
The web version of the classic linux productivity tool xroach. Cockroaches will scamper around the page until they find an image to hide behind.
I'm pushing these thoughts into the public sphere for two primary reasons:
(1) To facilitate a discussion around web search in which I can learn from others. I believe what I write here has solid merits but I also believe that we do our best work when we are challenged, encouraged, and refocused by others.
(2) To create a community of individuals who are interested in the future of web search. Particularly individuals who are interested in actively participating in this future.
Note that you needn't be part of the second for me to value your input on the first. I don't want to miss out on wisdom from those who have other commitments/priorities than this project.
A webpage bookmarking and snapshotting service.
Omnom consists of two parts; a multi-user web application that accepts bookmarks and snapshots and a browser extension responsible for bookmark and snapshot creation.
Omnom is a rebooted implementation of @stef's original omnom project, big thanks for it.
Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
In our modern age new text documents are born in a blink of an eye then (often just as quickly) disappear from the internet. We find it a noble task to save these documents for future generations.
This project aims to support this noble goal in a scalable way. We want to make the archival activity streamlined and easy to do even in a huge (Terabyte / Petabyte) scale. This way we hope that more and more people can start their own collection helping the archiving effort.
tinysearch is a lightweight, fast, full-text search engine. It is designed for static websites.
tinysearch is written in Rust, and then compiled to WebAssembly to run in a browser. It can be used together with static site generators such as Jekyll, Hugo, Zola, Cobalt, or Pelican.
The test index file of my blog with around 40 posts creates a WASM payload of 99kB (49kB gzipped, 40kB brotli).
Only finds entire words. As a consequence there are no search suggestions (yet). This is a necessary tradeoff for reducing memory usage. A trie datastructure was about 10x bigger than the xor filters. New research on compact datastructures for prefix searches might lift this limitation in the future.
Since we bundle all search indices for all articles into one static binary, we recommend to only use it for small- to medium-size websites. Expect around 2 kB uncompressed per article (~1 kb compressed).
WarcDB is a an SQLite-based file format that makes web crawl data easier to share and query. It is based on the standardized Web ARChive format, used by web archivers.
An online three-card tarot draw web page. Javascript but not node.js.
Online demo: https://lmorchard.github.io/tarot-thing/
Pyodide is a Python distribution for the browser and Node.js based on WebAssembly and makes it possible to install and run Python packages in the browser with micropip. Any pure Python package with a wheel available on PyPI is supported. Many packages with C extensions have also been ported for use with Pyodide. Comes with a robust Javascript ⟺ Python foreign function interface so that you can freely mix these two languages in your code with minimal friction. This includes full support for error handling (throw an error in one language, catch it in the other), async/await, and much more.
When used inside a browser, Python has full access to the Web APIs.
Github: https://github.com/pyodide/pyodide
Online REPL console: https://pyodide.org/en/stable/console.html
uBlacklist subscription list for developers.
Subscribe this list to block useless websites from Google Search results, such as machine-translated Stack Overflow clones.
Transforms tkinter, Qt, Remi, WxPython into portable people-friendly Pythonic interfaces, especially if you primarily do CLI tools. Tries to make it easy to build GUIs for applications, because ordinarily the process sucks. Supports several toolkits, including QT, WxPython, and Remi (if you want to turn something into a webapp); you can switch between those toolkits with a single line. No callback functions, that's all handled for you. Has a built-in debugger.
A company that does web scraping for you. Automatic retries, lots of proxies, geolocation, CAPTCHA bypass (eh?), Javascript support. Has a library of scrapers for different online services.
The free tier has only 1000 API calls. Multiple tiers of features.
CORS (Cross-Origin Resource Sharing) is hard. It's hard because it's part of how browsers fetch stuff, and that's a set of behaviours that started with the very first web browser over thirty years ago. Since then, it's been a constant source of development; adding features, improving defaults, and papering over past mistakes without breaking too much of the web.
Anyway, I figured I'd write down pretty much everything I know about CORS, and to make things interactive, I built an exciting new app.
Raindrop.io is the best place to keep all your favorite books, songs, articles or whatever else you come across while browsing.
We're not trying to reinvent the wheel; we're working on a tool that does everything you expect from a modern bookmark manager.
Collections of links. Folksonomy tags. Filters. Finds duplicates and broken links for you. Full text search. Automatically makes copies of every page you bookmark to prevent link rot.
Unlimited bookmarks, collections, and devices indefinitely at the free level. Additional features (probably collaboration) at paid tiers.
BadWolf is a minimalist and privacy-oriented WebKitGTK+ browser.
Privacy-oriented - No browser-level tracking, multiple ephemeral isolated sessions per new unrelated tabs, JavaScript off by default.
Minimalist - Small codebase (~1 500 LoC), reuses existing components when available or makes them available.
Customizable - WebKitGTK native extensions, Interface customizable through CSS.
Powerful & Usable - Stable User-Interface; The common shortcuts are available, no vi-modal edition or single-key shortcuts are used.
No annoyances - Dialogs are only used when required (save file, print, …), javascript popups open in a background tab.
Git repo: https://hacktivis.me/git/badwolf/
In the AUR.
A multithreaded hyperlink checker that crawls a site and looks for 404s. Unfortunately, not maintained anymore and written in Python2. Still useful.
Join the most popular Internet of Things platform with free Cloud, iOS and Android mobile apps, Web dashboard, and Machine Learning. Has mobile apps for interacting with interfaced devices. Assemble custom apps with a drag-and-drop builder. If it's networked and you can mess with it, you can get it talking to Blynk.
If you want to use their service, developer accounts are free but are limited to five (5) devices at a time. Paid service starts at $415us.
Open source: You can download the server's source code and run it yourself if you want. It's written in Java.
Wiby is a search engine for older style pages, lightweight and based on a subject of interest. Building a web more reminiscent of the early internet.
Futuristic sci-fi and cyberpunk graphical user interface framework for web apps. If you ever wanted to build a theme that looks like JARVIS or something out of Bladerunner, this seems like a good place to start.
Github repo: https://github.com/arwes/arwes