Mon 11 January 2021

Text Recognition in Python with pytesseract

Extracting text as string values from images is called optical character recognition (OCR) or simply text recognition. This blog post tells you how to run the Tesseract OCR engine from Python. For example, if you have the following image stored in diploma_legal_notes.png, you can run OCR over it to extract the string of text.

' \n\n \n\nCLASS OF 2019!\n\nYOUR DIPLOMA GRANTS YOU MANY NEW\nPOWERS AND PRIVILEGES. THESE INCLUDE:\n\n+ YOU MAY NOW LEGALLY PERFORM\nMARRIAGES AND ARREST PEOPLE.\n\n+ IF YOU HAVE YOUR DIPLOMA WITH YOU,\nYOU CAN USE. GROCERY STORE EXPRESS\nLANES WITH ANY NUMBER OF ITEMS.\n\n+ ALL GRADUATES ARE ENTITLED To\nDELETE ONE WORD OF THEIR CHOICE.\nFROM THE OXFORD ENGUSH DICTIONARY.\n+ THE UNIVERSITY WILL MAIL YOU YOUR’\nWORKING LIGHTSABER WITHIN 6-8 WEEKS.\n* YOU CAN SEND MAIL WITHOUT STAMPS.\n+ YOU HAVE EARNED THE RIGHT TO\nCHALLENGE THE BRITISH ROYAL FAMILY\nTO TRIAL BY COMBAT. IF YOU DEFEAT\nTHEM ALL, THE THRONE |S YOURS.\n\n* YOU MAY NOL IGNORE. "DO NOT PET”\nWARNINGS ON AIRPORT SECURITY DOGS.\n\n= CONGRATULATIONS,\n\n \n\n \n\x0c'

Notice that the text recognition isn't quite perfect. The "CONGRATULATIONS" at the top is mistakenly placed at the bottom, the word "NOW" is recognized as "NOL", the word "IS" has a vertical pipe character "|" instead of a capital "I", and there seems to be some confusing punctuation marks added in. Cleaning up your OCR scans depends on your particular project. This blog post will simply show you the code to do OCR from Python.

Installing Tesseract

First, run pip install pytesseract. The pytesseract package is a Python wrapper for the Tesseract OCR engine. If you need help running pip, see A Quick Pip Guide or What Is Pip? A Guide for New Pythonistas.

At this point, if you tried to use the pytesseract module, you'd get a TesseractNotFoundError message that says, tesseract is not installed or it's not in your path. (You can learn about the PATH environment variable in Chapter 2 of my free book, Beyond the Basic Stuff with Python.) Let's install the Tesseract OCR engine itself next.

On Windows, you can download the installer for version 5.0.0 of Tesseract and run the installer. (To get the latest version of Tesseract, go to the Tesseract at UB Mannheim web page.)

On macOS, according to this article, you can install Tesseract with Brew by opening a Terminal window and running brew install tesseract --all-languages. You can also install Tesseract on macOS with MacPorts by running sudo port install Tesseract, then run sudo port install tesseract-eng to install the English language. For other languages, use the language codes listed in this link.

On Linux, Tesseract may already be installed. If it isn't, according to this article, you can run the following:

On Ubuntu, run sudo apt-get install tesseract-ocr and then sudo apt-get install tesseract-ocr-all to install all languages. You also need to install ImageMagick for its convert program by running sudo apt-get install imagemagick.

On Fedora, run sudo dnf install tesseract

On Manjaro, run sudo pacman -Syu tesseract

Installing Pillow

The pytesseract module also requires the Pillow module for Python. You can install with pip by running pip install pillow on Windows or pip3 install pillow on macOS and Linux.

Running Tesseract from Python

To extract text from an image file named image.png, run the following code:

import pytesseract as tess
from PIL import Image
img = Image.open('image.png')
text = tess.image_to_string(img)
print(text)

The recognized text in the image is returned as a string value from image_to_string().

Additional Resources

To see an example of all of these steps, check out this YouTube video: Image to Text with pytesseract