eph baum dot dev

← Back to blog

Web Scraping 101

Published on 09/26/2021 07:15 PM by Eph Baum

Featured Image

So, listen, sometimes you needs to scrape some data, right?

Like, okay, let’s say you’re looking to build a list of folks to email from some public page, and you’re thinking, “boy it would be nice to do with with having to copy and paste each one”.

Mayhaps you also have a contact system or maybe you drop all in a BCC field on an email and tell a bunch of people about your interest in connecting with them and why.

You’re reasons are for you. Don’t be evil, please.

Anyway, here’s what you might do:

Let’s assume an html structure for visualizations’ sake.

<a class="important-email-contact"
   href="mailto:someone@example.com"
   data-label="...

This is nice because there’s an easy way to grab them through something like $('important-email-contact') on the console, but even if there isn’t a convenient class, in this care they all share the characteristic of pointing at a mailto: URI, so you should still be able to use some method to iterate through the DOM nodes.

For me I could build a simple list of the emails with a simple one liner:

copy(
    $('.important-email-contact')
        .map((i, e) => { 
            return e.href.substr(7); 
    });
);

The above is a pretty well optimized for my use case that gets the list of nodes I’m looking for, then gets the href content of every element and cuts the string off after mailto: in each result and then returns it as an object that I copy to the system’s clipboard.

In my case I then I’ll fire up nvim and maybe lop the top and bottom of the object and then a couple quick search and replace commands will have a cleaned up list I can paste somewhere useful.

:%s/ \+"\d\+": "//g

:%s/ ",//g

You could just lose the ” from the second command if you wanted a comma separated list with which to do whatever.

There are, of course, tools

For anything that might be visible on the page, there are tools you could use (or build) to get at the data.

An example that I’ve used is Simplescraper which works pretty well to grab any text you can see on a page from any repeating node.

Written by Eph Baum

  • Making Brutalist Design Accessible: A Journey in WCAG AA Compliance

    Making Brutalist Design Accessible: A Journey in WCAG AA Compliance

    How I transformed my brutalist blog theme to meet WCAG AA accessibility standards while preserving its vibrant, random aesthetic. Talking about contrast ratios, color theory, and inclusive design.

  • Building Horror Movie Season: A Journey in AI-Augmented Development

    Building Horror Movie Season: A Journey in AI-Augmented Development

    How I built a production web app primarily through 'vibe coding' with Claude, and what it taught me about the future of software development. A deep dive into AI-augmented development, the Horror Movie Season app, and reflections on the evolving role of engineers in the age of LLMs.

  • Chaos Engineering: Building Resiliency in Ourselves and Our Systems

    Chaos Engineering: Building Resiliency in Ourselves and Our Systems

    Chaos Engineering isn't just about breaking systems — it's about building resilient teams, processes, and cultures. Learn how deliberate practice strengthens both technical and human architecture, and discover "Eph's Law": If a single engineer can bring down production, the failure isn't theirs — it's the process.

  • Using LLMs to Audit and Clean Up Your Codebase: A Real-World Example

    Using LLMs to Audit and Clean Up Your Codebase: A Real-World Example

    How I used an LLM to systematically audit and remove 228 unused image files from my legacy dev blog repository, saving hours of manual work and demonstrating the practical value of AI-assisted development.

  • Migrating from Ghost CMS to Astro: A Complete Journey

    Migrating from Ghost CMS to Astro: A Complete Journey

    The complete 2-year journey of migrating from Ghost CMS to Astro—from initial script development in October 2023 to final completion in October 2025. Documents the blog's 11-year evolution, custom backup conversion script, image restoration process, and the intensive 4-day development sprint. Includes honest insights about how a few days of actual work got spread across two years due to life priorities.

  • 50 Stars - Puzzle Solver (of Little Renown)

    50 Stars - Puzzle Solver (of Little Renown)

    From coding puzzle dropout to 50-star champion—discover how AI became the ultimate coding partner for completing Advent of Code 2023. A celebration of persistence, imposter syndrome, and the surprising ways generative AI can help you level up your problem-solving game.