Downloading Imgur Posts Linked From Reddit with Python

UPDATE – I have updated this article to use BeautifulSoup to parse the HTML rather than regular expressions. This makes it much easier.

Reddit is a popular site that allows users to post and vote on interesting web links. It is divided into several topical subreddits. Many Redditors use Imgur to host their images (and I highly recommend it: Imgur is free and easy to use). This tutorial tells you how to write a Python script that can scan Reddit and download images from Imgur submissions you find. This tutorial is for beginner-level programmers with a small amount of Python experience.

You can download the source code directly or view the GitHub repo.

This post will cover:

  • Basic web scraping concepts.
  • Command line options.
  • Accessing Reddit with the PRAW module.
  • Using regular expressions to find text patterns in a web page.
  • Downloading files with the Requests module.
  • Detecting which files are on our computer with the os and glob modules.
  • Opening files using Python’s with statement.

Installing the PRAW, Requests, and Beautiful Soup Modules

The PRAW (Python Reddit API Wrapper) module is available on GitHub, but you can also install it using pip or easy_install (on Windows, these programs will be in the C:\Python27\Scripts folder). You can also install Requests and Beautiful Soup using pip and easy_install:

pip install praw

pip install requests

pip install beautifulsoup

or

easy_install praw

easy_install requests

easy_install beautifulsoup

To make sure that these were install successfully, try to import them from the Python interactive shell:

Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> import praw

>>> import requests

>>> from bs4 import BeautifulSoup

>>>

If you see no error messages, than the installation worked.

The full set of modules that our script will import are:

import re, praw, requests, os, glob, sys

Command Line Options with sys.argv

We normally run the Python script from the command line, like this:

> python imgur-hosted-reddit-posted-downloader.py

(You can change the script name if that’s too long for you.) However, from the command line we’d also like to specify the subreddit and also the minimum Reddit score a post (also called submission) needs before it will be downloaded. The subreddit will be required and the minimum score will be optional (defaulting to 100):

The sys.argv list contains the arguments passed from the command line. sys.argv[0] will be set to the string 'imgur-hosted-reddit-posted-downloader.py, and each subsequent item in the list will be an argument.

The following code displays help (and then exits the program) if no arguments were passed. Otherwise, the target subreddit is set to sys.argv[1] and the minimum score is set to sys.argv[2] (otherwise it defaults to 100).

Page 1 of 6 | Next page