Mon 30 September 2013

Downloading Imgur Posts Linked From Reddit with Python

UPDATE - I have updated this article to use BeautifulSoup to parse the HTML rather than regular expressions. This makes it much easier.

Reddit is a popular site that allows users to post and vote on interesting web links. It is divided into several topical subreddits. Many Redditors use Imgur to host their images (and I highly recommend it: Imgur is free and easy to use). This tutorial tells you how to write a Python script that can scan Reddit and download images from Imgur submissions you find. This tutorial is for beginner-level programmers with a small amount of Python experience.

You can download the source code directly or view the GitHub repo.

This post will cover:

Basic web scraping concepts.
Command line options.
Accessing Reddit with the PRAW module.
Using regular expressions to find text patterns in a web page.
Downloading files with the Requests module.
Detecting which files are on our computer with the os and glob modules.
Opening files using Python's with statement.

Installing the PRAW, Requests, and Beautiful Soup Modules

The PRAW (Python Reddit API Wrapper) module is available on GitHub, but you can also install it using pip or easy_install (on Windows, these programs will be in the C:\Python27\Scripts folder). You can also install Requests and Beautiful Soup using pip and easy_install:

pip install praw

pip install requests

pip install beautifulsoup

easy_install praw

easy_install requests

easy_install beautifulsoup

To make sure that these were install successfully, try to import them from the Python interactive shell:

Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> import praw

>>> import requests

>>> from bs4 import BeautifulSoup

>>>

If you see no error messages, than the installation worked.

The full set of modules that our script will import are:

import re, praw, requests, os, glob, sys

Command Line Options with sys.argv

We normally run the Python script from the command line, like this:

> python imgur-hosted-reddit-posted-downloader.py

(You can change the script name if that's too long for you.) However, from the command line we'd also like to specify the subreddit and also the minimum Reddit score a post (also called submission) needs before it will be downloaded. The subreddit will be required and the minimum score will be optional (defaulting to 100):

The sys.argv list contains the arguments passed from the command line. sys.argv[0] will be set to the string 'imgur-hosted-reddit-posted-downloader.py, and each subsequent item in the list will be an argument.

The following code displays help (and then exits the program) if no arguments were passed. Otherwise, the target subreddit is set to sys.argv[1] and the minimum score is set to sys.argv[2] (otherwise it defaults to 100).

MIN_SCORE = 100 # the default minimum score before it is downloaded

if len(sys.argv) < 2:

    # no command line options sent:

    print('Usage:')

    print('  python %s subreddit [minimum score]' % (sys.argv[0]))

    sys.exit()

elif len(sys.argv) >= 2:

    # the subreddit was specified:

    targetSubreddit = sys.argv[1]

    if len(sys.argv) >= 3:

        # the desired minimum score was also specified:

        MIN_SCORE = sys.argv[2]

The 3 Types of Submissions for Imgur

There are three types of Imgur links on Reddit that we are interested in:

Links to albums, such as http://imgur.com/a/VqUKy
Links to a page with a single image, such as http://imgur.com/4fVCo5v
Links that go directly to the image file, such as http://i.imgur.com/4fVCo5v.jpg

We can tell what kind of submission it is from the URL. Albums will always have /a/ in them, direct links to images are in the i.imgur.com domain, and single image pages are in the imgur.com domain but don't have /a/ in them.

Parsing the HTML

Open the example album URL, http://imgur.com/a/VqUKy, in a web browser and then right-click on the page and select View Source. This will display the HTML source for the page. We need to figure out the pattern in the HTML for links to the images on this page. Notice that for each image in the album, there is the HTML:



            View full resolution

This pattern does not happen anywhere else except for images in the album. This is important, because we don't want false positives and accidentally download non-featured images (such as the Imgur.com logo or other images on the page). We can have Beautiful Soup parse this HTML for us by creating a BeautifulSoup object (passing in the HTML) and then using the select() and passing a CSS selector string to specify the HTML elements we want to grab:

soup = BeautifulSoup(htmlSource)

matches = soup.select('.album-view-image-link a')

This CSS selector is used for the web pages of albums. CSS selectors are beyond the scope of this article, but the Beautiful Soup documentation has great examples. Suffice it to say, '.album-view-image-link a' will find all the HTML tags that are <a> tags that are descended from a tag with the album-view-image-link CSS class. (The dot means it is a CSS class name.)

The CSS selector string you need to use will need to be customized for the site you are downloading from. If the site ever changes their web page's HTML format, you may need to update the CSS selector strings you use.

The return value of soup.select() will be a list of BeautifulSoup "tag" objects. If you want to get the href attribute of the first match, your code will look like this:

matches[0]['href']

For parsing the URLs of directly-linked imgur images, we need to use regular expressions. (Beautiful Soup is used for parsing HTML, but not for general text like regexes.) Regular expressions are beyond the scope of this article, but Google has a good tutorial on Python regular expressions.

Note: Also, regular expressions are great for finding general patterns in text, but for HTML you are always much better off using an HTML-specific pattern matching library such as BeautifulSoup. (This guy explains why you shouldn't use regexes to parse HTML better than I can.)

The regex we use will match the image filename in a directly-linked image URL:

imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')

Downloading Image Files with the Requests Module

The Requests module is an incredibly easy to use module for downloading files off the web via HTTP. A string of the URL is passed to the requests.get() function, which returns a "Response" object containing the downloaded file. We'll create a separate downloadImage() function for our program to use that takes the url of the image and the filename to use when we save it locally to our computer:

def downloadImage(imageUrl, localFileName):

    response = requests.get(imageUrl)

We can examine the status_code attribute to see if the download was successful. An integer value of 200 indicates success. (A full list of HTTP status codes can be found on Wikipedia.)

if response.status_code == 200:

    print('Downloading %s...' % (localFileName))

The only output from our program is a single line telling us the file that it is downloading. Now that the downloaded image exists in our Python program in the Response object, we need to write it out to a file on the hard drive:

with open(localFileName, 'wb') as fo:

    for chunk in response.iter_content(4096):

        fo.write(chunk)

The with statement handles opening and closing the file (Effbot has a good tutorial called "Understanding Python's with Statement". The response object's iter_content() method can return "chunks" of 4096 bytes of the image at a time, which are written to the opened file. (This part of the code may be a bit confusing, but just understand that it writes the image information in the Response object to the hard drive.)

We will call this function whenever we get the URL of an image to download.

Accessing Reddit with the PRAW Module

Using the PRAW module to get a subreddit's front page is simple:

Import the praw module.
Create a Reddit object with a unique user agent.
Call the get_subreddit() and get_hot() methods.

(You can also read the full documentation for PRAW.)

The code looks like this:

# Connect to reddit and download the subreddit front page

r = praw.Reddit(user_agent='CHANGE THIS TO A UNIQUE VALUE') # Note: Be sure to change the user-agent to something unique.

submissions = r.get_subreddit(targetSubreddit).get_hot(limit=25)

A user agent is a string of text that identifies what type of web browser (or type of software in general) is accessing a web site. One of the Reddit API rules is to use a unique value for your user agent, preferably one that references your Reddit username (if you have one). The PRAW module handles throttling the rate of requests you make, so you don't have to worry about that

You can type javascript:alert(navigator.userAgent); into your browser's address bar (or just click the link) to see what your current user agent is.

The get_subreddit() method returns a "Subreddit" object, which has a get_hot() method which will return a list of "Submission" objects. (Actually, it returns a generator for Submission objects, but you can effectively think of it as a list.)

The Submission attributes we are interested in are:

id - A string like '1n49by' which uniquely identifies the submission in the subreddit.
score - An int of the net amount of up-votes the submission has.
url - A string of the URL for the submission. (For our program, this will always be a URL to Imgur.)

Skipping Files

We will loop through each of the Submission objects stored in submissions. At the start of the loop, we will check if the submission is one we should skip. This can be because:

It is not an imgur.com submission.
The submission's score is less than MIN_SCORE.
We have already downloaded the image.

The code for looping through all the submissions and the checks to continue to the next submission is:

# Process all the submissions from the front page

for submission in submissions:

    # Check for all the cases where we will skip a submission:

    if "imgur.com/" not in submission.url:

        continue # skip non-imgur submissions

    if submission.score < MIN_SCORE:

        continue # skip submissions that haven't even reached 100 (thought this should be rare if we're collecting the "hot" submission)

    if len(glob.glob('reddit_%s_*' % (submission.id))) > 0:

        continue # we've already downloaded files for this reddit submission

The glob Module

The images that we download will have filenames formatted as reddit_[subreddit name]_[reddit submission id]_album_[album id]_imgur_[imgur id]. A glob is sort of a simplified regular expression, where the * asterisk is a "wildcard character" that matches any text. The glob.glob() function will return a list of files that match the glob string it is passed.

For example, say one of the submissions in the cats subreddit has an id of 1n3p6o (which is this submission) then the filenames we use for it will begin with "reddit_cats_1n3p6o_".

Calling glob.glob('reddit_cats_1n3p6o_*') will return a list of filenames that match this pattern. If this returned list is not empty (that is, it's length is greater than zero) then we know that these files already exist on the hard drive and should not be downloaded again.

Parsing Imgur Album Pages

First we will handle the album downloads. The id for the album is the part of the url right after "http://imgur.com/a/", so we can use list slicing to extract it from submission.url. (We will use the album id later in the local filename.)

The Submission object's url string is passed to requests.get() to download the album page's html. We immediately save the text of this download to a variable htmlSource:

if 'http://imgur.com/a/' in submission.url:

    # This is an album submission.

    albumId = submission.url[len('http://imgur.com/a/'):]

    htmlSource = requests.get(submission.url).text

This code finds all the instances of the image url pattern in the html source:

soup = BeautifulSoup(htmlSource)

        matches = soup.select('.album-view-image-link a')

        for match in matches:

            imageUrl = match['href']

            if '?' in imageUrl:

                imageFile = imageUrl[imageUrl.rfind('/') + 1:imageUrl.rfind('?')]

            else:

                imageFile = imageUrl[imageUrl.rfind('/') + 1:]

(Some URLs end with ?=1 on the imgur site, so we cut those off.)

The findall() method returns a list of all the matches found in the string it is passed (in our case, this is htmlSource). We pass this returned list to frozenset() to convert it to the frozen set type, which is essentially a list with only unique values. This will remove any duplicate matches. The returned frozen set is then passed to list() to convert it back to a list.

We use the match['href'] string to get the URL of the image, which is then used for the local filename and telling the Requests module what to download on the next couple of lines:

localFileName = 'reddit_%s_%s_album_%s_imgur_%s' % (targetSubreddit, submission.id, albumId, imageFilename)

downloadImage('http:' + match['href'], localFileName)

Downloading Directly-Linked Images

The next type of download will be for directly-linked images. For this type, submission.url is already the complete url of the file to download, but we need the filename on Imgur.com to use in the local filename. The imgurUrlPattern regex will be used to grab this part from submission.url:

elif 'http://i.imgur.com/' in submission.url:

    # The URL is a direct link to the image.

    mo = imgurUrlPattern.search(submission.url)



    imgurFilename = mo.group(2)

For some reason, some of the images on Imgur.com have an additional "?1" at the end of their filenames. We'll need some code to check for this and strip it out of imgurFileanme using slicing:

if '?' in imgurFilename:

    # The regex doesn't catch a "?" at the end of the filename, so we remove it here.

    imgurFilename = imgurFilename[:imgurFilename.find('?')]

Now that we have the imgurFilename correctly formatted, we can download the image:

localFileName = 'reddit_%s_%s_album_None_imgur_%s' % (targetSubreddit, submission.id, imgurFilename)

downloadImage(submission.url, localFileName)

Downloading Images Off a Single-Image Web Page

The third type of download is when the Reddit post links to an Imgur page that contains one image.

soup = BeautifulSoup(htmlSource)

imageUrl = soup.select('.image a')[0]['href']

Otherwise, the program downloads the image (which is in group 2 of the match object returned by search()):

localFileName = 'reddit_%s_%s_album_None_imgur_%s' % (targetSubreddit, submission.id, imageFile)

downloadImage(imageUrl, localFileName)

Once the loop has gone through all the submissions, the program terminates. All of the files it found will have been downloaded.

Example Run

Here's the output when I ran the program:

C> python imgur-hosted-reddit-posted-downloader.py cats 100

Downloading reddit_cats_1n8zs8_album_None_imgur_ktHbtZL.jpg...

Downloading reddit_cats_1n836l_album_None_imgur_tWNf47w.jpg...

Downloading reddit_cats_1n8p3g_album_None_imgur_nrRNoiF.jpg...

Downloading reddit_cats_1n6cr0_album_None_imgur_rA10E3s.jpg...

Downloading reddit_cats_1n89mg_album_None_imgur_0UqnMd6.png...

Downloading reddit_cats_1n80j6_album_None_imgur_ZuRbyxp.jpg...

Voila! Now you can use cron or Windows Task Scheduler to automatically download images from your favorite subreddit! Some of my recommendations:

Good luck!