Web Scraping 101

Published on 09/26/2021 07:15 PM by Eph Baum

So, listen, sometimes you needs to scrape some data, right?

Like, okay, let’s say you’re looking to build a list of folks to email from some public page, and you’re thinking, “boy it would be nice to do with with having to copy and paste each one”.

Mayhaps you also have a contact system or maybe you drop all in a BCC field on an email and tell a bunch of people about your interest in connecting with them and why.

You’re reasons are for you. Don’t be evil, please.

Anyway, here’s what you might do:

Let’s assume an html structure for visualizations’ sake.

<a class="important-email-contact"
   href="mailto:someone@example.com"
   data-label="...

This is nice because there’s an easy way to grab them through something like $('important-email-contact') on the console, but even if there isn’t a convenient class, in this care they all share the characteristic of pointing at a mailto: URI, so you should still be able to use some method to iterate through the DOM nodes.

For me I could build a simple list of the emails with a simple one liner:

copy(
    $('.important-email-contact')
        .map((i, e) => { 
            return e.href.substr(7); 
    });
);

The above is a pretty well optimized for my use case that gets the list of nodes I’m looking for, then gets the href content of every element and cuts the string off after mailto: in each result and then returns it as an object that I copy to the system’s clipboard.

In my case I then I’ll fire up nvim and maybe lop the top and bottom of the object and then a couple quick search and replace commands will have a cleaned up list I can paste somewhere useful.