Extracting text as string values from images is called optical character recognition (OCR) or simply text recognition. This blog post tells you how to run the Tesseract OCR engine from Python. For example, if you have the following image stored in diploma_legal_notes.png, you can run OCR over it to extract the string of text.
' \n\n \n\nCLASS OF 2019!\n\nYOUR DIPLOMA GRANTS YOU MANY NEW\nPOWERS AND PRIVILEGES. THESE INCLUDE:\n\n+ YOU MAY NOW LEGALLY PERFORM\nMARRIAGES AND ARREST PEOPLE.\n\n+ IF YOU HAVE YOUR DIPLOMA WITH YOU,\nYOU CAN USE. GROCERY STORE EXPRESS\nLANES WITH ANY NUMBER OF ITEMS.\n\n+ ALL GRADUATES ARE ENTITLED To\nDELETE ONE WORD OF THEIR CHOICE.\nFROM THE OXFORD ENGUSH DICTIONARY.\n+ THE UNIVERSITY WILL MAIL YOU YOUR’\nWORKING LIGHTSABER WITHIN 6-8 WEEKS.\n* YOU CAN SEND MAIL WITHOUT STAMPS.\n+ YOU HAVE EARNED THE RIGHT TO\nCHALLENGE THE BRITISH ROYAL FAMILY\nTO TRIAL BY COMBAT. IF YOU DEFEAT\nTHEM ALL, THE THRONE |S YOURS.\n\n* YOU MAY NOL IGNORE. "DO NOT PET”\nWARNINGS ON AIRPORT SECURITY DOGS.\n\n= CONGRATULATIONS,\n\n \n\n \n\x0c'
Notice that the text recognition isn't quite perfect. The "CONGRATULATIONS" at the top is mistakenly placed at the bottom, the word "NOW" is recognized as "NOL", the word "IS" has a vertical pipe character "|" instead of a capital "I", and there seems to be some confusing punctuation marks added in. Cleaning up your OCR scans depends on your particular project. This blog post will simply show you the code to do OCR from Python.
pip install pytesseract. The pytesseract package is a Python wrapper for the Tesseract OCR engine. If you need help running pip, see A Quick Pip Guide or What Is Pip? A Guide for New Pythonistas.
At this point, if you tried to use the
pytesseract module, you'd get a
TesseractNotFoundError message that says,
tesseract is not installed or it's not in your path. (You can learn about the PATH environment variable in Chapter 2 of my free book, Beyond the Basic Stuff with Python.) Let's install the Tesseract OCR engine itself next.
On macOS, according to this article, you can install Tesseract with Brew by opening a Terminal window and running
brew install tesseract --all-languages. You can also install Tesseract on macOS with MacPorts by running
sudo port install Tesseract, then run
sudo port install tesseract-eng to install the English language. For other languages, use the language codes listed in this link.
On Linux, Tesseract may already be installed. If it isn't, according to this article, you can run the following:
On Ubuntu, run
sudo apt-get install tesseract-ocr and then
sudo apt-get install tesseract-ocr-all to install all languages. You also need to install ImageMagick for its
convert program by running
sudo apt-get install imagemagick.
On Fedora, run
sudo dnf install tesseract
On Manjaro, run
sudo pacman -Syu tesseract
The pytesseract module also requires the Pillow module for Python. You can install with pip by running
pip install pillow on Windows or
pip3 install pillow on macOS and Linux.
Running Tesseract from Python
To extract text from an image file named image.png, run the following code:
import pytesseract as tess from PIL import Image img = Image.open('image.png') text = tess.image_to_string(img) print(text)
The recognized text in the image is returned as a string value from
To see an example of all of these steps, check out this YouTube video: Image to Text with pytesseract