eph baum dot dev

← Back to blog

Web Scraping 101

Published on 09/26/2021 07:15 PM by Eph Baum

Featured Image

So, listen, sometimes you needs to scrape some data, right?

Like, okay, let’s say you’re looking to build a list of folks to email from some public page, and you’re thinking, “boy it would be nice to do with with having to copy and paste each one”.

Mayhaps you also have a contact system or maybe you drop all in a BCC field on an email and tell a bunch of people about your interest in connecting with them and why.

You’re reasons are for you. Don’t be evil, please.

Anyway, here’s what you might do:

Let’s assume an html structure for visualizations’ sake.

<a class="important-email-contact"
   href="mailto:someone@example.com"
   data-label="...

This is nice because there’s an easy way to grab them through something like $('important-email-contact') on the console, but even if there isn’t a convenient class, in this care they all share the characteristic of pointing at a mailto: URI, so you should still be able to use some method to iterate through the DOM nodes.

For me I could build a simple list of the emails with a simple one liner:

copy(
    $('.important-email-contact')
        .map((i, e) => { 
            return e.href.substr(7); 
    });
);

The above is a pretty well optimized for my use case that gets the list of nodes I’m looking for, then gets the href content of every element and cuts the string off after mailto: in each result and then returns it as an object that I copy to the system’s clipboard.

In my case I then I’ll fire up nvim and maybe lop the top and bottom of the object and then a couple quick search and replace commands will have a cleaned up list I can paste somewhere useful.

:%s/ \+"\d\+": "//g

:%s/ ",//g

You could just lose the ” from the second command if you wanted a comma separated list with which to do whatever.

There are, of course, tools

For anything that might be visible on the page, there are tools you could use (or build) to get at the data.

An example that I’ve used is Simplescraper which works pretty well to grab any text you can see on a page from any repeating node.

Written by Eph Baum

← Back to blog
  • 50 Stars - Puzzle Solver (of Little Renown)

    50 Stars - Puzzle Solver (of Little Renown)

    Join Eph Baum as they recount their journey through the Advent of Code 2023. For the first time, Eph completes all puzzles, leveraging resources like GPT-4 and Code Llama. Despite the challenges and time constraints, Eph not only stays on top of the puzzles but also lands on the top 1,000 leaderboard. Dive into this post to explore the role of generative AIs in problem-solving and the joy of coding puzzles. - GitHub Co-pilot

  • Don't Trust AI - An Advent of Code Tale

    Don't Trust AI - An Advent of Code Tale

    Join Eph Baum in 'Don't Trust AI - An Advent of Code Tale' as they navigate the Advent of Code 2023. Despite the December rush, Eph is determined to complete all puzzles. This post shares an intriguing incident where an AI-generated code line proves less than helpful. Eph's journey underscores the importance of verifying AI suggestions, especially when optimizing code. Dive in to explore the challenges and triumphs of coding puzzles, and the role of AI in this process. - GitHub CoPilot

  • Condoning Another Pi Day

    Condoning Another Pi Day

    Placeholder description for imported post from Ghost Blog

  • ANSI Terminal Colors

    ANSI Terminal Colors

    Placeholder description for imported post from Ghost Blog

  • WTF is Idiomatic

    WTF is Idiomatic

    Placeholder description for imported post from Ghost Blog

  • From Early Return in OOP to Control Flow in Elixir - A Transition Guide

    From Early Return in OOP to Control Flow in Elixir - A Transition Guide

    Placeholder description for imported post from Ghost Blog