17PDF AND WORD DOCUMENTS

The PyPDF and Python-Docx packages can read and write PDF and Word documents, respectively, saving you from needing to edit files yourself. If you learn how to use these packages, you’ll be able to automate document tasks efficiently by writing quick and accurate Python programs.

Practice Questions

The following questions test your ability to read and modify PDF and Word documents using Python.

PDF Documents

PDF stands for Portable Document Format and uses the .pdf file extension. Although PDFs support many features, the questions in this section will focus on three common tasks: extracting a document’s text content, extracting its images, and crafting new PDFs from existing documents.

1. What do you pass to the pypdf.PdfReader() function to open a PDF file?
2. Where can you find the individual Page objects of a PdfReader object?
3. Write the code for a function named get_num_pages() that accepts a PDF’s filename as a string and returns the number of pages it has.
4. Which method of Page objects extracts text from a PDF?
5. Write code that extracts the text of page 2 of a PDF file. Assume a variable named reader contains the PdfReader object.
6. Which pdfminer function extracts text from a PDF file, and which argument must you pass to this function?
7. How can you automatically clean up the extracted text strings from a PDF in a way that respects the context of the text?
8. Which pypdf function lets you create new PDF files?
9. Can the PyPDF package write arbitrary text to a PDF file in the same way that Python can write arbitrary text to a .txt file?
10. Can you rotate a page in a PDF by 45 degrees using pydpdf or pdfminer?
11. Write code that creates a new file named rotated.pdf that has the contents of example.pdf, but rotated clockwise by 90 degrees.
12. What Page method allows you to add a watermark to a page?
13. What PdfWriter method adds a blank page to the end of a PDF document?
14. Write the code to insert a blank page as the new page 3 in a PDF document. Assume a variable named writer contains the PdfWriter object.
15. Which modern encryption algorithm do experts recommend you use to encrypt your PDF files?
16. Why would elephant be a poor password to use to encrypt your PDF files?
17. What two types of passwords do PDF files support for encryption?

Word Documents

Python can create and modify Microsoft Word documents, which have the .docx file extension, with the Python-Docx package. Compared to plaintext files, .docx files have many structural elements, which Python-Docx represents using three different data types. At the highest level, a Document object represents the entire document. The Document object contains a list of Paragraph objects for the paragraphs in the document. (A new paragraph begins whenever the user presses enter or return in a Word document.) Each of these Paragraph objects contains a list of one or more Run objects.

Answer the following questions about Word documents and the Python-Docx package. Where relevant, assume that a variable named doc stores the Document object.

18. Write code that opens a file named demo.docx and stores the Document object in a variable named doc.
19. What code would get a string value of the text in the second paragraph of a Document object?
20. What code would retrieve the number of paragraphs in a Document object?
21. True or false: Document objects contain Paragraph objects, which in turn contain Run objects.
22. True or false: To italicize some text in a paragraph and bold some other text in that same paragraph, you must set the bold and italic attributes of the Paragraph object to True.
23. Which of the following have a text attribute: Document objects, Paragraph objects, or Run objects?
24. To what three values can you set attributes of Run objects such as bold, italic, and strike, and what do these values mean?
25. What code adds a paragraph to a document with the text “Hello, world!” in the built-in Title style?
26. What kind of objects have the add_paragraph() method?
27. What kind of objects have the add_run() method?
28. Create a blank .docx document in either Microsoft 365 or another application. Then, open it with Python-Docx. How many Paragraph objects does this empty document contain? How many Run objects does this empty document contain?
29. Write a program that creates a Word document named millionstars.docx that has exactly one million asterisks; no more, no less.
30. Write a program that creates a Word document named countdown.docx that counts down from 1,000 to 0, with one number per paragraph.

Practice Projects

Each of these projects re-creates a feature that is already available in a PDF or word processing app, but implementing them as Python code lets you automatically process hundreds or thousands of documents.

PDF Document Word Counter

Write a function named pdf_word_count(pdf_filename) that opens the given PDF file, extracts the text from it, and returns a word count of the document. To calculate the word count, call the split() method on the text. Different PDF files and packages may produce different word counts, but a rough value suffices for the purposes of this project.

Searching All PDFs in a Folder

While PDF apps allow you to search for text in a PDF file with ctrl-F, most won’t allow you to search an entire folder of files all at once. Write a program that extracts all the text from the PDF, searches for some given text, and returns each instance where it’s found.

Define a function named search_all_PDFs(text, folder='.', case_sensitive=False) that searches for the text string argument in PDF files in the folder named folder. The case_sensitive parameter should have a default value of False, but if passed True, the function should report only matches in the same case as text.

The function should return a list of strings formatted as 'In {filename} on page {page_number}'.

Word Document Logger for Guess the Number

Your boss wants to see the output of the Guess the Number game in Chapter 3 of Automate the Boring Stuff with Python. They have the peculiar demand that the text be presented in a Word document. Their personal assistant will print the Word document, add it to a pile on their desk, and throw it away next week, unread.

Take the Guess the Number game and add code to it to generate a guessWordLog.docx file. You can find this source code in the downloadable resources link at https://nostarch.com/automate-boring-stuff-python-3rd-edition. After each print() function call in the original code, insert code that writes the printed text to the Word document as a new paragraph. For example, your code could contain something like this:

print('I am thinking of a number between 1 and 20.')
doc.add_paragraph('I am thinking of a number between 1 and 20.')

Follow every call to input() with code that adds the player’s input to the Word document as well. If the guessWordLog.docx file already exists, your program should add the new paragraphs to it, after the existing text.

Save this program in a file named guessWordLog.py.

Converting Text Files to Word Documents

Write a function named str_to_docx(text, word_filename). The text argument should be a multiline string of contents to write to a new Word document, while the word_filename argument should be a string representing the Word document’s filename. Each line in the multiline string should become its own Paragraph object in the Word file.

Next, write code for a program that calls str_to_docx() to create Word documents for every .txt file in the current working directory. The program should add the .docx extension to the end of a file, saving the contents of spam.txt as a file named spam.txt.docx, for example.

Save this program in a file named txt2docx.py.

Bolding Words in a Word Document

Write a function named bold_words(filename, word) that opens the Word document in the filename file and formats every occurrence of the string in word as bold text. The function shouldn’t modify the original document’s filename; instead, it should write the results to a file with .bold.docx appended to the end. For example, calling bold_words('demo.docx', 'hello') would create a demo.docx.bold.docx file in which every case-sensitive match of 'hello' has been bolded. The original Word document should remain the same.

For simplicity, you may assume that the original Word document has no styling in it and uses only the default font. Your bold_words() function should construct the new Word document by creating Paragraph objects with separate Run objects that each have the bold attribute set to True or False. For example, if demo.docx contained a single paragraph with the text “Say hello to Alice,” calling bold_words ('demo.docx', 'hello') would create a Word document with one Paragraph object and three Run objects for the text 'Say ', 'hello', and ' to Alice'. The middle Run object containing 'hello' would be set to bold.

Save this program in a file named boldWords.py.