Web Scraping 101

Sometimes you might want to get some data from a webpage, and sometimes you might want to do that using some code.

Web Scraping 101
Photo by Pankaj Patel / Unsplash

So, listen, sometimes you needs to scrape some data, right?

Like, okay, let's say you're looking to build a list of folks to email from some public page, and you're thinking, "boy it would be nice to do with with having to copy and paste each one".

Mayhaps you also have a contact system or maybe you drop all in a BCC field on an email and tell a bunch of people about your interest in connecting with them and why.

You're reasons are for you. Don't be evil, please.

Anyway, here's what you might do:

Let's assume an html structure for visualizations' sake.

<a class="important-email-contact"

This is nice because there's an easy way to grab them through something like $('important-email-contact') on the console, but even if there isn't a convenient class, in this care they all share the characteristic of pointing at a mailto: URI, so you should still be able to use some method to iterate through the DOM nodes.

For me I could build a simple list of the emails with a simple one liner:

        .map((i, e) => { 
            return e.href.substr(7); 

The above is a pretty well optimized for my use case that gets the list of nodes I'm looking for, then gets the href content of every element and cuts the string off after mailto: in each result and then returns it as an object that I copy to the system's clipboard.

In my case I then I'll fire up nvim and maybe lop the top and bottom of the object and then a couple quick search and replace commands will have a cleaned up list I can paste somewhere useful.

:%s/ \+"\d\+": "//g

:%s/ ",//g

You could just lose the " from the second command if you wanted a comma separated list with which to do whatever.

There are, of course, tools

For anything that might be visible on the page, there are tools you could use (or build) to get at the data.

An example that I've used is Simplescraper which works pretty well to grab any text you can see on a page from any repeating node.