How to write a simple web spider in Python

After the fairly heavy nature of the last few articles I thought I would do something more fun. I wrote a simple web spider to check for bad links on my website and thought I would show it to you.

Here's a summary of what it does in a pseudo-code I just made up:

crawl(link):
    html = get_page(link)
    links = extract_links(html)
    for link in links:
        crawl(link)

Wow, that pseudo-code is dangerously close to Python! 🐍

Recursion

You'll see the process is simple, and recursive - that is we repeat a process by have the function call itself. What happens is, when you make a function call, the current state (or context as it is known) is pushed on to the stack of the process. The function does its thing, and when the function returns the previous context is popped off the stack, so that execution can continue. Typically what would happen in the recursive situation is you get repeated function calls with context piling up on the stack, and then when the procedure starts to unwind, and function calls return, the saved context is popped off the stack for each return. Where a process has a deep level of recursion the stack can grow quite large. I'll come back to this later...

That's basically it, the rest is just details! Like for example we need to keep track of the pages we visit so we don't crawl them repeatedly. This might happen on site where a lot of the pages have the same link (for example to the index page).

The code for the web spider

So, without further ado, here's the code:

import sys
import requests
from urllib.parse import urljoin
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

    links = []

    # 'attrs' is list of tuples
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for a in attrs:
                if a[0] == 'href':
                    self.links.append(a[1])

def write_log(f, link, status):
    msg = "{link} -- {status}\n".format(link=link, status=status)
    f.write(msg)
    f.flush()

def debug_print(msg):
    if debug_mode:
        print(msg)

def get_links(link):
    links = []
    try:
        headers = {'user-agent': 'gagamba/0.0.1'}
        r = requests.get(link, headers=headers)
        print(r.status_code)
        if r.status_code == 200:
            if link.startswith(base): # We do want to check offsite links, but we don't want to crawl them
                parser.feed(r.text)
                m = parser.links
                for link in m:
                    link = urljoin(base, link)
                    links.append(link)
                    debug_print(link)
                links = list(set(links))  # Remove duplicates
        else:
            write_log(fout, link, r.status_code)
    except Exception as e:
        print("FATAL")
        write_log(fout, link, "FATAL EXCEPTION") # Only log errors
    return links

def crawl(link):
    msg = "Checking page --> {link} -- ".format(link=link)
    print(msg, end='')
    links = get_links(link)
    for link in links:
        if link in visited:
            continue
        visited.append(link)
        crawl(link)

#######    MAIN    ########

debug_mode = False
parser = MyHTMLParser()

if len(sys.argv) < 2:
    print("Usage: gagamba.py base")
    sys.exit()

base = sys.argv[1]
print("Crawling from base --> ", base)
visited = []
filename = "errors.log"
with open(filename, "w") as fout:
    crawl(base)
    print("Number of links checked: --> ", len(visited))

Now to get into the nitty gritty.

Using HTMLParser

First, you'll notice I used HTMLParser - why? Well I need to grab all the links off a web page. Originally I used a regex like:

regex = r'<a[\s\S]*?href=["\'](\S*?)["\']>'
m = re.findall(regex, r.text, re.MULTILINE)

The problem is the little edge cases you don't think about. For example, the first one I hit was where I'd forgotten that an href (and other HTML attributes) can use single or double quotes. I had originally only handled double quotes.

The thing is, knowing me, there are probably quite a few of such things I had not thought of. HTMLParser already has these factored in and also can handle broken HTML quite well. It also has the advantage that it's simple to use. It's also part of the standard library. Thumbs up for HTMLParser as it takes all the regex grunt work out of the program for you.

One slight wrinkle that you need to wrap your head around is that HTMLParser returns the attributes of a tag as a list of tuples. I had originally expected it to return a dictionary so you could do something like attrs['href']. Nope. Once you know you need to deal with a list of tuples though it's fairly simple as you can see from the code.

Handling snafus in the Interwebs

When getting a page it's important to wrap requests.get in a try...except. The reason is sometimes you'll get a link and it is just Wrong. For example, the site might not exist at all or has a dodgy certificate. In this case an error code won't be returned you'll get an exception which you should handle, otherwise you might be in for a very short crawl.

It's also important to set the user-agent before making requests. You will get back some interesting error codes back from some servers if you don't set it. For example, I got back a 418 from flightradar24.com. If you're not familiar with 418 look it up and try not to laugh! I also got back a 999 from LinkedIn which was interesting! I don't think anyone else returns that code.

Stop crawling all over me!

Because I use this spider as a link checker for my site, I want to make sure it checks sites I link to, but doesn't crawl them. You can use a simple check to see if the domain name is your base site or not. If it's not, you don't crawl. This should allow you to check your site, but not crawl over anyone elses. That is basically being a good Netizen.

Normalizing links

When you grab the links from a site, you need to normalize them as they will be fed to requests, and requests need an aboslute URL to work with - a relative URL is not enough. For this reason I normalize the links from a page, and as I only grab links from my own domain I can just use urljoin(base, link). This will handle a lot of edge cases for you, such as trailing forward slashes and other wrinkles. urljoin just seems to do the right thing with the various odd combos of base and link I threw at it during testing. Another thumbs up for using a library - initially I didn't - I rolled my own link handling and soon realized there were edge cases I was not thinking about. Moral of the story - use a library if there is one!

Don't forget to flush

You'll notice that I only write errors to the log file to make analysis easier. I make sure I flush the buffer after each write with f.flush(). This makes sure that if something goes badly wrong the log file is written. This is also why I use with to open the output file. If there's an exception that crashes the program the output file should still be closed.

Stack overflows and the hazards of recursion

This is just a really basic spider, and is not meant for heavy-duty web crawling. On the other hand it is quite capable of checking the links on a site of modest size. If you attempt to crawl anything too big you are likely to grind to a halt with a stack overflow - remember this spider is recursive and it eats stack as it keeps calling itself to try and make its way through all the grabbed links...

To increase the stack size allocated to a Python process you can do:

# Run me as sudo
import resource
resource.setrlimit(resource.RLIMIT_STACK, (resource.RLIM_INFINITY, resource.RLIM_INFINITY))

That would, in theory, give you unlimited stack size. That's not necessarily a good thing though as if you have an errant process things could get crazy as your process runs up an enormous stack. I prefer to leave the Python default and just have a stack overflow exception to let me know I've bitten off more than I can chew. The other exception you may run into is a "recursion limit exceeded" exception. This is where your recursion has gone too deep. You can also change this with sys.setrecursionlimit() if you must.

Things to do with the code

You might also extract other data on a web page beside the links (you'll still need the links to crawl though). You could extend your HTMLParser to look for data in a certain format. Perhaps you need to spider a site looking for specific data - the price of a certain watch for example. You could also improve the code - I don't handle non-http protocols well (or at all), for example you could make sure mailto: is handled gracefully. As I only spider my own site looking for bad links I don't check the robots.txt file. You could add code to check that and don't spider if the site rejects robots.

Really what you do is up to you as long as it's legal!

I think that's about it. Have fun and happy hacking! 🐱‍💻

IMPORTANT: Always spider responsibly to avoid putting a load on other people's servers. I only ever spider my own site to check for broken links. Thanks.

p.s. In case you were wondering about my user agent - gagamba is the name of my spider - because it's Tagalog for spider.

p.p.s If you reuse any of the code on this site, which is released under the MIT license, don't forget to check my Legal page. Basically you can use any of the code here for free, but don't come crying to me if it goes wrong (which it probably will).