24TEXT-TO-SPEECH AND SPEECH RECOGNITION ENGINES

Python’s powerful libraries for working with audio enable you to automate tasks involving both text-to-speech and speech recognition. Using the pyttsx3 package, your programs can convert text into the spoken word and generate audio files. By contrast, the Whisper speech recognition package can transcribe spoken language from audio files into text strings.

A simple drawing of a light bulb. LEARNING OBJECTIVES

  • Produce audio files of spoken speech based on arbitrary string values or text files.
  • Know the settings and limitations of pyttsx3’s text-to-speech capabilities.
  • Install Whisper and perform speech recognition on your local computer with Whisper’s different training models.
  • Create subtitles from audio and video files with timestamps that match the words spoken.
  • Download video files from YouTube and other video websites with the yt-dlp package.

A grey circle with a white question mark at the center Practice Questions

The following questions test your ability to work with the pyttsx3 and Whisper packages to automate tasks like generating audio feedback, transcribing voice memos, or integrating speech capabilities into your Python projects.

Text-to-Speech Engine

Producing a computerized voice is a complex topic in computer science, so the pyttsx3 third-party package uses your operating system’s built-in text-to-speech engine: Microsoft Speech API (SAPI5) on Windows, NSSpeechSynthesizer on macOS, and eSpeak on Linux.

  1. 1. What does the tts in pyttsx3 stand for?

  2. 2. Does pyttsx3 require an online service to work?

  3. 3. How does pyttsx3 produce speech on Windows, macOS, and Linux?

  4. 4. After you’ve imported the pyttsx3 module, how do you initialize the text-to-speech engine?

  5. 5. If you call engine.say('Hello. How are you doing?'), does the computer say anything?

  6. 6. In what audio file format does pyttsx3 save its audio?

  7. 7. What are the three properties that pyttsx3 makes available?

  8. 8. What does engine.setProperty('rate', 300) do?

  9. 9. What does engine.setProperty('volume', 2.0) do?

  10. 10. Write code that could save the audio of “Is it raining today?” to an audio file named raining.wav. (You can ignore the required runAndWait() call.)

  11. 11. What code creates a hello.wav file of “Hello. How are you doing?” (You can ignore the required runAndWait() call.)

  12. 12. Does the voice that speaks your text sound the same across Windows, macOS, and Linux?

Speech Recognition

Whisper is a speech recognition system that can recognize multiple languages. Given an audio or video file, Whisper can return the speech as text in a Python string. It also returns the start and end times for groups of words, which you can use to generate subtitle files.

  1. 13. What is the correct package name to use when installing Whisper with the pip tool?

  2. 14. What function must you call after importing the whisper module but before supplying the audio filename to transcribe?

  3. 15. What are the string values of the five models that Whisper provides?

  4. 16. Between the tiny model and the large-v3 model, which uses less of the computer’s memory?

  5. 17. Between the tiny model and the large-v3 model, which transcribes audio more quickly?

  6. 18. Between the tiny model and the large-v3 model, which transcribes audio more accurately?

  7. 19. What is the recommended model to use for most transcriptions?

  8. 20. Write code that transcribes the English speech in an audio file named input.mp3. (Assume you’ve imported Whisper and loaded a model.)

  9. 21. Write code that transcribes the Spanish speech in an audio file named input.mp3. (Assume you’ve imported Whisper and loaded a model.)

  10. 22. Does Whisper insert punctuation into the text it transcribes?

  11. 23. What two subtitle text file formats does Whisper produce? What are their file extensions?

  12. 24. Say that the dictionary returned by model.transcribe() is stored in a variable named result. What two lines of code would write a subtitle file named podcast.srt to the current working directory?

  13. 25. If your computer has an Intel or Apple brand of GPU, can you make Whisper use the GPU to do speech recognition?

  14. 26. What code loads the “base” model and uses the GPU to perform speech recognition?

Creating Subtitle Files

In addition to the transcribed audio, Whisper’s results dictionary contains timing information that identifies the text’s location in the audio file. You can use this text and timing data to generate subtitle files that other software can ingest.

  1. 27. The .srt and .vtt files produced by Whisper are plaintext file formats. What information do these files contain?

  2. 28. What does SRT stand for?

  3. 29. What does VTT stand for?

  4. 30. In addition to .srt and .vtt files, what other kinds of files is Whisper capable of producing?

  5. 31. Say the variable result contains the value returned from model.transcribe('audio.wav'). What code produces a subtitle file named subtitles.srt?

  6. 32. What are the column headings in the TSV-formatted subtitles that Whisper produces?

Downloading Videos from Websites

Video websites such as YouTube often don’t make it easy to download their content. The yt-dlp module allows Python scripts to download videos from YouTube and hundreds of other video websites so that you can watch them offline.

  1. 33. What is the module name of the yt-dlp package you must use in import statements? (It’s not “yt-dlp.”)

  2. 34. Write the Python code to download the video at https://www.youtube.com/watch?v=kSrnLbioN6w.

  3. 35. How is the filename of the downloaded video selected by default?

  4. 36. What kind of data does a .m4a file contain?

  5. 37. What method returns a video’s title, duration, channel name, and other metadata?

A simple drawing of a sharpened pencil. Practice Projects

Write knock-knock jokes, make your computer sing, and create a word search for podcasts.

Knock-Knock Jokes

Write a program that uses pyttsx3 to tell a knock-knock joke using two different voices. Here’s an example joke you could use:

VOICE 1: “Knock knock.”

VOICE 2: “Who’s there?”

VOICE 1: “Lettuce.”

VOICE 2: “Lettuce who?”

VOICE 1: “Lettuce in, it’s cold out here!”

You’ll need to set the 'voice' property before calling say() and runAndWait() for each line of the joke.

Save this program in a file named sayKnockKnock.py.

12 Days of Christmas

While a text-to-speech package like pyttsx3 can make your computer talk, it can’t make your computer sing. We’ll forgive that deficiency for this project, though.

Write a program that sings the carol “The 12 Days of Christmas.” This is an example of a cumulative song; the first verse is “On the first day of Christmas, my true love gave to me a partridge in a pear tree.” The second verse builds on top of this: “On the second day of Christmas, my true love gave to me two turtle doves and a partridge in a pear tree.”

This pattern continues for 12 days. In total, the song comprises 90 lines, but your program can be much shorter. Rather than typing the song’s full lyrics, you should generate the verses with code. Use the following lists in your program:

days = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh',
'eighth', 'ninth', 'tenth', 'eleventh', 'twelfth']

verses = ['And a partridge in a pear tree.', 'Two turtle doves,',
'Three French hens,', 'Four calling birds,', 'Five gold rings,',
'Six geese a-laying,', 'Seven swans a-swimming,', 'Eight maids a-milking,',
'Nine ladies dancing,', 'Ten lords a-leaping,', 'Eleven pipers piping,',
'Twelve drummers drumming,']

Your program should both print the verses to the screen and then make pyttsx3 speak them out loud. Place a time.sleep(2) call at the end of each day’s verses to pause the program before it continues to the next day.

Note that the first day’s verse is “A partridge in a pear tree,” while the subsequent days use “And a partridge in a pear tree.” Feel free to hardcode the verse for the first day and then automatically generate the verses beginning on the second day.

Podcast Word Search

Say you want to find every instance of a particular word being spoken in a podcast. Podcasts can be over an hour long, and this task would require you to listen to the full thing. You could play the podcast at double speed to make the process faster, but you might miss occurrences of the word you’re searching for.

The srt module available at https://pypi.org/project/srt/ can parse SRT files. Review this module’s documentation, then install it. Next, create a function named find_in_audio(audio_filename, search_word) that takes two string arguments: the podcast filename and the word to search for in that podcast.

The function should use Whisper to create a .srt subtitle file of the words in the podcast audio file. Then, the function should use the srt module to parse the subtitle objects and locate instances of the search word argument. For example, the following function call would find every instance of the word amino spoken in an audio file named DNA_lecture.mp3:

find_in_audio('DNA_lecture.mp3', 'amino')

The function should return a list of starting timestamps for each instance. The srt module uses timedelta objects for these timestamps, but your function should convert them to strings before putting them in the returned list. For example, if the word amino is spoken six times in the audio file, the return value could look something like this:

['0:00:37.792000', '0:00:42.332000', '0:01:37.389000', '0:02:45.497000',
'0:05:55.576000', '0:07:41.252000']

Because transcribing the audio and creating the subtitle file is the computationally expensive part of this function, have your function check whether this file already exists before transcribing the audio file. If it already exists, skip the transcription and simply search this subtitle file. Give the .srt file the same name as the audio file. For example, passing the argument 'DNA_lecture.mp3' should create a subtitle file named DNA_lecture.srt.

Here is a template for a possible solution, if you wish to use it:

import whisper, srt, os

def find_in_audio(audio_filename, search_word):
    # Convert search_word to lowercase for case-insensitive matching:
    # INSERT CODE HERE.
    # Check if the subtitle file already exists:
    if not os.path.exists(audio_filename[:-4] + '.srt'):
        # Transcribe the audio file:
        # INSERT CODE HERE.

        # Create the subtitle file:
        # INSERT CODE HERE.

    # Read in the text contents of the subtitle file:
    with open(audio_filename[:-4] + '.srt', encoding='utf-8') as file_obj:
        # INSERT CODE HERE.

    # Go through each subtitle and collect timestamps of matches:
    found_timestamps = []
    for subtitle in srt.parse(content):
        if search_word in subtitle.content.lower():
            # INSERT CODE HERE.

    # Return the list of timestamps:
    # INSERT CODE HERE.

print(find_in_audio('DNA_lecture.mp3', 'amino'))

You can download an example audio file from https://autbor.com/DNA_lecture.mp3 or use your own.