Text Recognition in Python with pytesseract
Mon 11 January 2021 Al Sweigart
Extracting text as string values from images is called optical character recognition (OCR) or simply text recognition. This blog post tells you how to run the Tesseract OCR engine from Python. For example, if you have the following image stored in diploma_legal_notes.png, you can run OCR over it to extract the string of text.
' \n\n \n\nCLASS OF 2019!\n\nYOUR DIPLOMA GRANTS YOU MANY NEW\nPOWERS AND PRIVILEGES. THESE INCLUDE:\n\n+ YOU MAY NOW LEGALLY PERFORM\nMARRIAGES AND ARREST PEOPLE.\n\n+ IF YOU HAVE YOUR DIPLOMA WITH YOU,\nYOU CAN USE. GROCERY STORE EXPRESS\nLANES WITH ANY NUMBER OF ITEMS.\n\n+ ALL GRADUATES ARE ENTITLED To\nDELETE ONE WORD OF THEIR CHOICE.\nFROM THE OXFORD ENGUSH DICTIONARY.\n+ THE UNIVERSITY WILL MAIL YOU YOUR’\nWORKING LIGHTSABER WITHIN 6-8 WEEKS.\n* YOU CAN SEND MAIL WITHOUT STAMPS.\n+ YOU HAVE EARNED THE RIGHT TO\nCHALLENGE THE BRITISH ROYAL FAMILY\nTO TRIAL BY COMBAT. IF YOU DEFEAT\nTHEM ALL, THE THRONE |S YOURS.\n\n* YOU MAY NOL IGNORE. "DO NOT PET”\nWARNINGS ON AIRPORT SECURITY DOGS.\n\n= CONGRATULATIONS,\n\n \n\n \n\x0c'
Notice that the text recognition isn't quite perfect. The "CONGRATULATIONS" at the top is mistakenly placed at the bottom, the word "NOW" is recognized as "NOL", the word "IS" has a vertical pipe character "|" instead of a capital "I", and there seems to be some confusing punctuation marks added in. Cleaning up your OCR scans depends on your particular project. This blog post will simply show you the code to do OCR from Python.
Installing Tesseract
First, run pip install pytesseract
. The pytesseract package is a Python wrapper for the Tesseract OCR engine. If you need help running pip, see A Quick Pip Guide or What Is Pip? A Guide for New Pythonistas.
At this point, if you tried to use the pytesseract
module, you'd get a TesseractNotFoundError
message that says, tesseract is not installed or it's not in your path
. (You can learn about the PATH environment variable in Chapter 2 of my free book, Beyond the Basic Stuff with Python.) Let's install the Tesseract OCR engine itself next.
On Windows, you can download the installer for version 5.0.0 of Tesseract and run the installer. (To get the latest version of Tesseract, go to the Tesseract at UB Mannheim web page.)
On macOS, according to this article, you can install Tesseract with Brew by opening a Terminal window and running brew install tesseract --all-languages
. You can also install Tesseract on macOS with MacPorts by running sudo port install Tesseract
, then run sudo port install tesseract-eng
to install the English language. For other languages, use the language codes listed in this link.
On Linux, Tesseract may already be installed. If it isn't, according to this article, you can run the following:
On Ubuntu, run sudo apt-get install tesseract-ocr
and then sudo apt-get install tesseract-ocr-all
to install all languages. You also need to install ImageMagick for its convert
program by running sudo apt-get install imagemagick
.
On Fedora, run sudo dnf install tesseract
On Manjaro, run sudo pacman -Syu tesseract
Installing Pillow
The pytesseract module also requires the Pillow module for Python. You can install with pip by running pip install pillow
on Windows or pip3 install pillow
on macOS and Linux.
Running Tesseract from Python
To extract text from an image file named image.png, run the following code:
import pytesseract as tess from PIL import Image img = Image.open('image.png') text = tess.image_to_string(img) print(text)
The recognized text in the image is returned as a string value from image_to_string()
.
Additional Resources
To see an example of all of these steps, check out this YouTube video: Image to Text with pytesseract