Mon 21 February 2022

Downloading Web Pages and Files in Python 3 and 2 Without the Requests Module

I created a Python module called whatismyip that allows Python programs to easily figure out what their internet protocol (IP) address is. It works by connecting to one of several public websites that return this information, such as https://icanhazip.com/.

Because it was a module to be included in other programs, I wanted it to have as few dependencies as possible. Normally I would use the requests module to download these web pages, but I wanted to stick to just the Python standard library. Here's how I used only the standard library on Python 3 and 2 to download the HTML of a webpage:

import sys

if sys.version_info[0] == 3:  # Python 3
    from urllib.request import Request, urlopen
elif sys.version_info[0] == 2:  # Python 2
    from urllib2 import Request, urlopen

# Supply a user-agent header of a common browser, since some web servers will refuse to reply to scripts without one.
# 'https://ifconfig.me' is a website that returns simple info about your request. Replace this with the page you want to download.
requestObj = Request('https://ifconfig.me/all', headers={'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0'})
responseObj = urlopen(requestObj)

# To figure out how to decode the downloaded binary data to text, we need to get the character set encoding:
if sys.version_info[0] == 3:  # Python 3
    charsets = responseObj.info().get_charsets()
    if len(charsets) == 0 or charsets[0] is None:
        # Character set encoding could not be determined.
        charset = 'utf-8'  # Use the utf-8 encoding by default.
    else:
        # Use the first character set encoding listed. (It's often the only one.)
        charset = charsets[0]
elif sys.version_info[0] == 2:  # Python 2
    charset = responseObj.headers.getencoding()
    if charset == '7bit':
        # Even though getencoding() returns '7bit', this is an unknown encoding to decode(). '7bit' is the same as 'ascii'
        charset = 'ascii'

content = responseObj.read().decode(charset)
print(content)  # The HTML of the web page.

And here is code for downloading a binary file (such as a .png image or .zip file) from a URL:

import sys

if sys.version_info[0] == 3:  # Python 3
    from urllib.request import Request, urlopen
elif sys.version_info[0] == 2:  # Python 2
    from urllib2 import Request, urlopen

# Replace https://inventwithpython.com/images/cover_automate2_thumb.jpg with the file you want to download.
url = 'https://inventwithpython.com/images/cover_automate2_thumb.jpg'

# Supply a user-agent header of a common browser, since some web servers will refuse to reply to scripts without one.
requestObj = Request(url, headers={'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0'})
responseObj = urlopen(requestObj)

content = responseObj.read()
# Replace foo.jpg with the local filename you want to use:
filename = url.split('/')[-1]  # Use the filename from the url.
with open(filename, 'wb') as fileObj:
    fileObj.write(content)

The urllib module in Python 2 was the original downloading module in the Python standard library added in Python 1.2. The urllib2 module in Python 2 had additional features and was added in Python 1.6. In Python 3, there is a new module just called urllib. There are also third-party modules named urllib3 and requests (which uses urllib3) but these aren't in the Python standard library nor will they be added to it.