22RECOGNIZING TEXT IN IMAGES

While the Pillow package in the previous chapter can easily create images with text, extracting text from images is an advanced topic in computer science. Fortunately, the PyTesseract package handles the details of machine learning and image processing for you. With a small amount of preparation, your programs can convert screenshots and scanned documents into text strings using just a few lines of code.

Practice Questions

These questions test your knowledge of PyTesseract and the NAPS2 scanner application.

Installing Tesseract and PyTesseract

To work with PyTesseract, you must install the free Tesseract optical character recognition (OCR) engine software on your Windows, macOS, or Linux computer. You can also choose to install language packs for non-English languages. Then, you must install the PyTesseract package so that your Python scripts can interact with Tesseract.

1. What is the difference between Tesseract and PyTesseract?
2. Can your Python scripts do OCR with just PyTesseract installed, and not Tesseract?
3. What do the eng.traineddata and jpn.traineddata files contain?
4. Is Tesseract installed in the same way on Windows, macOS, and Linux?

OCR Fundamentals

Using PyTesseract and the Pillow image library, you can extract text from an image in four lines of code. OCR has limitations, however, and you need to understand what kinds of images are suitable for it.

5. Does PyTesseract require Pillow to be installed?
6. What PyTesseract function takes an Image object argument and returns a string of the text in that image?
7. Can PyTesseract identify fonts, font sizes, and font colors?
8. Can PyTesseract extract text from a scanned document of typed text?
9. Can PyTesseract extract text from a scanned document of handwritten text?
10. Can PyTesseract extract the text of a license plate from a photo of a car?
11. In general, will PyTesseract preserve the layout of the source text, such as hyphenated words broken across lines?
12. In general, how reliable are LLMs at cleaning up the extracted text from PyTesseract?
13. Can you usually use the spellchecker to identify incorrectly extracted words from PyTesseract?
14. What about to identify incorrectly extracted numbers?

Recognizing Text in Non-English Languages

Tesseract can extract text in languages other than English if you install additional language packs, then specify the language PyTesseract should recognize.

15. Tesseract identifies characters of what language by default?
16. How can you view a list of all the languages that Tesseract supports?
17. What keyword argument would you pass to make the image_to_string() function recognize Japanese characters?
18. What happens if you don’t pass this keyword argument to image_to_string() while passing it an image of Japanese characters?
19. What keyword argument would you pass to make the image_to_string() function recognize English and Japanese characters in the same document?

The NAPS2 Scanner Application

A common use case for OCR is creating PDF documents of scanned images with searchable text. Although there are apps to do this, they often don’t offer the flexibility needed to automate PDF generation for hundreds or thousands of images. For tasks like these, you can use the open source Not Another PDF Scanner 2 (NAPS2) application, which runs Tesseract and adds text to PDF documents.

20. How much does the NAPS2 app cost?
21. Which operating systems is the NAPS2 app available on?
22. What Python module allows you to run NAPS2 from your Python program?
23. What does the command line flag -i followed by frankenstein.png mean to the NAPS2 app?
24. What does the command line flag -o followed by output.pdf mean to the NAPS2 app?
25. If you already have the English language pack installed, what does the command line flag --install followed by ocr-eng do?
26. What command line flags would you pass to install the Japanese language pack for NAPS2?
27. What does the command line flag -n followed by 0 mean to the NAPS2 app?
28. What does the command line flag -i followed by page1.png;page2.png mean to the NAPS2 app?

Practice Projects

In the following projects, you’ll extract text from a collection of comics in order to search images, and automate resizing images.

Searchable Web Comics

I like to download the images of various web comics and memes I find online. I have a large collection of these—so large that I have trouble finding specific ones. I can’t do a text search for the contents of images, but I could use PyTesseract to extract the text from the images, then search that text. It won’t be perfect, but it should work most of the time.

Create a program that runs PyTesseract on every .png image in the current working directory and creates a dictionary that maps the image filename to its extracted text. You can store this dictionary as JSON in a file named imageText.json so that you need to run the extraction program only once. Then, you can open the JSON file in any editor and ctrl-F for the text you are looking for.

You can download a selection of images to use from this book’s downloadable contents at https://nostarch.com/automate-workbook.

Save this program in a file named makeImageTextJSON.py.

Enhancing Text in Web Comics

Let’s extend the program from the previous practice project. Web comic images are usually smaller and simpler than high-resolution photos. Sometimes their text is too small for PyTesseract to accurately recognize. One trick you can try is increasing the size of the web comic image using Pillow and checking if this improves PyTesseract’s text recognition.

For example, when I run PyTesseract on the original image at https://xkcd.com/1968/, it returns the following string:

'fo SOMES AD NCED A UELOES SELRQURE\nEROUEE fb OMB, NGREEPARLE PRD REBELS
AGRO\nee HUMAN CONTROL 22\n\n+ >\n\n2\nTHE PART LOTS OF PEOPLE\nSEEM To WORRY
ABOUT\n\nTHE PART I WORRY ABOUT\n\n'

If I first use Pillow to double the size of the image and then run PyTesseract, it gives me more accurate results:

'Al BECOMES SELF-AWARE\nAND REBELS AGAINST\nHUMAN CONTROL\n\nA\nTHE PART LOTS OF
PEOPLE\nSEEM To WORRY ABOUT\n\nTHE PART I WORRY ABOUT\n\n'

Many people misread my name, Al, as the term AI, but PyTesseract seems to make the opposite mistake in this case. Your machine may produce slightly different text based on the PyTesseract version used.

Create an updated version of the program from the previous practice project that automatically resizes the image to twice the original width and height and then performs OCR on the enlarged image. Save the dictionary mapping the enlarged filenames to the extracted text as JSON in a file named imageTextEnlarged.json. Compare the accuracy of the text in this file with that of the text in imageText.json from the previous project.

Save this program in a file named makeImageTextEnlargedJSON.py.