How PyAutoGUI is Used, According to Google Scholar
Posted by Al Sweigart in misc
PyAutoGUI is a GUI automation Python package I created and maintain. It allows your Python scripts to control the mouse and keyboard on Windows, macOS, and Linux. PyAutoGUI is featured in my book, Automate the Boring Stuff with Python (available for free online) and gained a toehold of popularity. With the rise of RPA (robotic process automation, which has nothing to do with robots and as far as I can tell is just another term for scripts that control the mouse and keyboard) I've been meaning to bring PyAutoGUI up to mature standards of features and documentation. As a first step in that direction, I decided to take a look at academics use of PyAutoGUI according to Google Scholar search results and write up my findings in this blog post.
PyAutoGUI is meant to be used by non-professional developers, though I don't want it to fall into the trap of no-code/low-code tools by creating it's own set of conventions. That is, I want PyAutoGUI to be accessible yet still recognizably "pythonic".
The Heart Team
Original link: Implementing PyautoGUI for Enhanced Heart Team Protocol Creation: Improving Efficiency in Cardiovascular Patient Management (2025)
Some German academics in the medical field used PyAutoGUI for automating data entry. I should make a note to emphasize that the canonical capitalization is "PyAutoGUI" but this isn't a critical detail.
PyautoGUI utilizes screen coordinates dependent on the respective monitor resolution, predefined to 1080x1920 pixels, and is designed for programmatically control the mouse and keyboard. The screen coordinate system is based on this resolution, with the origin point (0,0) in the top left corner and the maximum coordinate (1079,1919) at the bottom right corner. This fixed framework ensures consistency and precision for all interactions.
Yeah, this is unfortunately best practice for PyAutoGUI as it is now. I'd like PyAutoGUI to have features that make it less reliant on screen resolution and absolute coordinates, something where it can find application windows and interact with their menus and UI widgets directly instead of blindly clicking and typing into whatever field has focus.
Perhaps in the documentation I should advise storing XY coordinates in constants rather than hard-coded in function calls. Or perhaps PyAutoGUI should have an internal means of storing these. This could have not just XY coordinates, but also pixel RGB color data of the thing to click on and pausing if the pixel at the XY coordinates mismatches the expected color.
The AS operates exclusively based on pre-defined coordinates – without recognizing or interpreting sensitive patient data, therefore, ensuring strict adherence to data security and ethical standards.
I assume AS stands for "automation software". The hacker in me realizes that this software could easily be coded to read in patient data (the current best practice is to send a Ctrl-A to Select All and a Ctrl-C to copy the contents of text fields to the clipboard, then use pyperclip to read the contents of the clipboard. But there might be some security solace in a program that doesn't directly read in the whole of the sensitive data. This is a niche feature, but I'll keep this in mind.
After completing this setup we conducted a coordinate mapping of the relevant fields, ensuring an alignment of the scrolling process of the mouse. Particular attention was given to synchronizing checkbox interactions and text field matching to establish a seamless flow between the source patient data and the target protocol fields.
Scrolling a window to get the UI widgets in place has always been a tedious and brittle process.
The automated process group demonstrated a mean protocol generation time of 151.8 seconds (SD, 12.66), while the human processing group exhibited a longer mean time of 193.6 seconds (SD, 80.33).
I assume the reason the automated process group was that slow was that PyAutoGUI was used while the humans were working, instead of running entirely on its own. I do want PyAutoGUI scripts to have some measure of interruption, being able to be paused or back up a few steps if things get out of hand. Perhaps a first step would be to change the fail safe feature (which raises an exception if the mouse cursor is moved to one of the corners of the screen) brought up a dialogue window to quit rather than raise the exception. There's an idea.
What keeps me from just adding it is that it's a breaking change in behavior. Perhaps I should add some kind of environment variable check to see if this should be enabled.
Automated Login on the Telegram Desktop App
Original link: Hybrid Desktop Automation Using Selenium and PyAutoGUI (2025) PDF link
Additional features include logging mechanisms for traceability and automated screenshot capturing for debugging and reporting.
This something I've wanted to add to PyAutoGUI too: improved logging, including screenshot logging (especially so you can tell exactly where a mouse click happened.)
Traditional tools like AutoIt, SikuliX, and WinAppDriver depend on static coordinates or image matching. Such methods are fragile and fail under UI changes or resolution mismatches.
Currently PyAutoGUI also has these shortcomings. This paper is about using PyAutoGUI in conjunction with Selenium (also covered in Chapter 13 of Automate the Boring Stuff with Python). I should look at Selenium, Playwright, and other web drivers for feature ideas for PyAutoGUI.
Selenium, although a web tool, has strong capabilities for timing, modularity, and control logic.
Yes. One reason I included covering Playwright in the third edition of my Automate book was because it had features for detecting and waiting for UI elements to load before attempting to click on them. PyAutoGUI should also have features to wait for expected UI conditions too. Right now, the best practice is unfortunately just throwing in a time.sleep() call (or a pyautogui.sleep() call) and hope the UI is ready at the end of the wait.
Automating Data Entry for the US Federal Highway Administration's Traffic Noise Model Software
Original link: Automate TNM Input Process Using Python (2021)
Since 2004, we have been using TNM 2.5 to analyze the noise impacts from traffic. The decades-old Graphical User Interface (GUI) is inadequate to handle large scale projects with hundreds of receptors and roadways. TNM 3.0 has a vastly better user interface, but its import function is still buggy as of early 2021.
This is something I specifically built PyAutoGUI for: interacting with desktop applications that don't have APIs. I created a wrapper for Posterazor app (which makes multipage PDFs of images so you can print them out and tape them together to create large posters.) PyAutoGUI should also be designed so that making such wrappers is easy.
In addition to the batch file and input files, users are also required to take certain screenshots of TNM on their machine. These buttons include “New,” “Apply,” and the first row for coordinate inputs on the barrier input screen, and more. These screenshots are used by PyAutoGUI to find the location of these buttons on the screen. With differing screen resolutions, zoom levels, and TNM setups, this allows the script to dynamically find the buttons without requiring users to manually set anything up.
Yes, this is another feature that PyAutoGUI has: finding screenshotted UI elements on the screen. It can take up to 1 second to do a full search. I also have an undocumented feature that OpenCV can do fuzzy matching. I need to look into making this a standard feature, as well as automatically adjusting for screen resolutions.
The only drawback is that the keyboard and mouse will be controlled by the program and the computer cannot be used during the time of automation. However, this automation process can be implemented during after-work hours when the noise analyst is not at their desk. Therefore, the ben- efits of using this program far outweigh the potential drawback.
Okay, this has convinced me: PyAutoGUI definitely needs to have features so that it can be interrupted and restarted. I've been meaning to look at Temporal, an open source Python package which seems to do something like this.
This script can potentially be applied to TNM 3.0 if TNM 3.0 is backward compatible. The image matching technique from PyAu- toGUI can also be applied to TNM 3.0 should any repetitive process be identified during the input process.
Earlier I mentioned how PyAutoGUI should advise storing XY coordinates in constants, but this should also apply to screenshot elements for buttons, etc. In fact, it may be ideal to have some mechanisms for storing image data as base64-encoded strings so they can be included with the .py Python script. I've always have a preference for single-file scripts because they can be easily shared as email attachments or pastebins. PyAutoGUI should have a feature to easily make base64 strings of images (and vice versa), previewing these images, and adjusting them for different screen resolutions.
Clicking on Links
Original link: An Automated Testing Tool Based on Graphical User Interface with Exploratory Behavioural Analysis (2022)
We have used the PyAutoGUI automation python library for controlling the mouse to click the numerous networks in the certain n URL [1]. By way of scrutinizing the subsequent screenshot using the OpenCV object recognition method for new discrepancies and GUI classification can be done by beyond engaging a DNN classifier.
Heh. It looks like they just use the click() function. Fair enough. I'm glad that doing simple things in PyAutoGUI is simple.
Thesis of Tools for Test Automation for Windows GUI Applications
Original link: Test automation for Windows GUI application (2023) PDF link
The thesis covers various aspects of test automation, including different automation methodologies, tools used in automation, implementation, and example of test automation in use. The goal of thesis is to create interface automation for Windows applications using Python and Robot Framework.
Robot Framework is another GUI automation tool, an old-school one from 2008 that is still used today. Another task I've given myself is to evaluate all the other GUI automation and RPA tools to compare and contrast them. Actually, this Wikipedia article is probably a good place to start: https://en.wikipedia.org/wiki/Comparison_of_GUI_testing_tools
(Huh. Why isn't PyAutoGUI listed there? They're more applications than programming language libraries, but perhaps a productized form of PyAutoGUI would be a good idea to give it visibility in this space?)
The thesis in part looks at using PyAutoGUI to automate the Bittium Tough VoIP Call Center app for Windows. It looks like a call center phone app, with a GUI that looks like it was meant for tablets (mostly large buttons).
PyWinCtl is a library for Python that provides functionality for window manipulation in Windows and Linux environments. PyWinCtl is a fork of PyGetWindow but with cross platform support that the original library is missing. Original library is created by the same developer who created PyAutoGUI, and this makes it easy to use it in conjunction with PyAutoGUI to create more complex automation. Notable functionality that PyWinCtl provides are opening, closing, resizing, and moving windows into specific locations [18.]
Ack, my bad. I've been meaning to extend PyGetWindow to non-Windows platforms for years now. This thesis chose PyAutoGUI for it's cross-platform support (Windows and Linux specifically) so I should bump this up in priority. Being able to run on any OS has always been an important feature, as all the other Python packages only worked on one OS or only worked for Python 2 or 3. PyAutoGUI has dropped Python 2 support, but I should look at how it can be used on multiple platforms (including mobile platforms Android and iPhone.)
The author mentions that PyWinAuto was passed up because even though it can access menu items (PyAutoGUI cannot), it only runs on Windows and Linux support was needed. This deepens the case for being cross-platform across all features.
For the automation to be able to interact with GUI elements the application needs to be started. There are multiple ways to start an application, easiest way would be the use CMD or key combination of WIN+R to launch Run command. Inside Run command user can give .exe file name and windows will launch given program. However, in this use case the application requires .bat files to be ran at the start-up so the Run command method would not work. While these methods can be easier to create using similar process to (Code Block 4-1) the automation uses method more akin to an end-user’s method of interacting with the application. The automation starts the application using virtual keyboard to give key input of ‘WIN’ key and then typing the applications name to the search bar. After inputting applications name into the search filed automation presses ‘ENTER’ key and wait until the application can be seen by Windows and the automation. If there is a problem during the start-up like the application not starting after specified time an error is raised, and that test step is set as failure.
Oh wow, this is less than ideal. Launching the program from the Start menu is brittle and just begging to have some system change to cause it to fail. Though this does give me an idea: Why not have PyAutoGUI be able to run applications like subprocess? And not just executable programs, but scripts and batch files and Start menu items? PyAutoGUI should also be able to check if the application is already running as well.
On Windows, PyAutoGUI uses the Microsoft UI automation framework, while on Linux it uses the Xlib library. PyAutoGUI provides a high-level, easy-to-use API that abstracts away the platform- specific implementation details and provides a consistent interface for GUI automation. This makes it easy to develop tests that can be run on multiple platforms without having to worry about platform-specific implementation details. PyAutoGUI also provides several built-in functions for common GUI automation tasks, such as clicking, entering text, and capturing screenshots, making it a powerful library for GUI automation [15.]
Aww, I like reading flattery. But PyAutoGUI doens't use "Microsoft UI automation framework" (such a thing doesn't exist) but it uses the Windows API directly. I think this might have just been misspoken.
This code from the thesis may provide a clue as to how users want to use the screenshot feature:
def take_screenshot(filename=None, subfolder=None):
try:
default_directory = os.path.join(os.getcwd(), "robot_screenshots")
if not os.path.exists(default_directory):
os.mkdir(default_directory)
if subfolder is not None:
subfolder_path = os.path.join(default_directory, subfolder)
if not os.path.exists(subfolder_path):
os.mkdir(subfolder_path)
default_directory = subfolder_path
if filename is None:
filename = pyautogui.datetime.datetime.now().strftime("%d-%m-%Y %H-%M-%S") + ".png"
else:
filename = pyautogui.datetime.datetime.now().strftime(f"{filename}_%d-%m-%Y %H-%M-%S.png")
screenshot_path = os.path.join(default_directory, filename)
pyautogui.screenshot(screenshot_path)
with open(screenshot_path, "rb") as f:
imgdata = base64.b64encode(f.read()).decode("utf-8")
imgtag = f'
'
BuiltIn().log(f'Screenshot:
{imgtag}
', html=True)
return screenshot_path
except Exception as e:
logging.error(f"Error while taking screenshot: {e}")
return None
Ha! The author does something similar to my "base64 encode image data" idea I mentioned earlier, though they use it to create a single-file .html file for a report. It's good to know my idea resonants with other people and isn't just in my own head.
The pyautogui.datetime.datetime.now() is interesting; I guess they assume datetime is part of PyAutoGUI instead of the Python standard library? I should take a look at the __slots__ setting in PyAutoGUI.
Hand Gesture Recognition for a Virtual Mouse
Paper: A Vision-Based Virtual Mouse Using Hand Gesture Recognition (2025)
Paper: Development of a Python-Driven Wireless Mouse with Integrated Gesture Recognition Technology
Paper: Application Controlling Using Hand Gestures Through Yolov5s (2024)
Paper: An AI-Powered Multi-Functional Text-Based Assistive System for People with Disabilities (2024)
Paper: Implementation of Contact free Human Computer Interface: Virtual Mouse (2025)
Paper: Hand Gesture Controlled Whiteboard Using OpenCv
Paper: Enhancing Accessibility: A Natural Human-Computer Interaction System Using Hand Gestures and Voice for Disabled Populations
Paper: Handcrafted AI: Designing Virtual Hardware for Hand Gesture-Based Interaction (2025)
It seems like PyAutoGUI was used for its click() and moveTo() functions for these. There's quite a few of these "virtual mouse with hand gesture recognition" papers.
I'm always wary of these "software for the disabled" projects because often times the engineers don't actually talk to disabled people they are creating things for. You see this with a "we made a glove that interprets ASL and then speaks the words out loud" even though anyone who knows ASL can tell you that there's so much more to ASL than just the hand movements. I find it odious enough that I don't want to directly link to the papers here.
Headless Mode
Original link: A comparison of rpa tool and python programming language for a bom digitalization project in automobile industry
The main library used for this digitalization study is the pyautogui library. This library can interact with other programs with keyboard and mouse control. Pyautogui is available for web and desktop applications. It is advantageous to use the pyautogui library as many operations such as saving screenshots, clicking different buttons, printing data, showing warning and message boxes, closing an application, resizing the page, moving the mouse to the desired location and clicking can be done with the pyautogui library. Because the library uses mouse and keyboard commands, the person who works on the computer screen cannot do any other operation. This feature of the library is a disadvantage for the user because it blocks the computer for the duration of the transaction. There are libraries that allow the user to do different operations on the computer while the tool is running, without blocking the mouse and keyboard of user's computer.
A headless mode for PyAutoGUI is far out of scope and very OS-dependent, but having interruptability as a feature would be nice.
Gatcha Game Cheating
Original link: Clicker Bot for Gacha Games Using Image Recognition
I've used PyAutoGUI to automate playing a "diner dash" style Flash game called Sushi Go Round after seeing someone else do so. I have the bot's source code online and you can watch a video of it playing.
The author used OpenCV to do some pre-processing of screenshots to figure out where to click to grind levels and farm resources in some casino game for kids (as I call gatcha games). I don't approve of these wastes of time, so I approve of the use of PyAutoGUI to circumvent them.
I've been meaning to add DirectX support for Windows video games and Wayland support for Linux applications.
Agentic PyAutoGUI
Original link: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows (2025) PDF link
The author puts an LLM in control of their computer by having it generate code that calls PyAutoGUI:
You are required to use ‘pyautogui‘ to perform the action grounded to the observation, but DO NOT use the ‘pyautogui.locateCenterOnScreen‘ function to locate the element you want to operate with since we have no image of the element you want to operate with. DO NOT USE ‘pyautogui.screenshot()‘ to make screenshot.
You can replace x, y in the code with the tag of elements you want to operate with, such as:
“‘
pyautogui.moveTo(tag_3)
pyautogui.click(tag_2)
pyautogui.dragTo(tag_1, button=’left’)
“‘
Not much else beside that, but it is interesting to know that people are using PyAutoGUI this way.
Bioinformatics Software Automation
Original link: Big Data Bot with a Special Reference to Bioinformatics
Seems like they were creating an automation wrapper for some bioinformatics software called MARVIN.
Different software and websites need different periods of time to respond, and since this time is not predictable and depends on many factors, such as the internet speed and server speed, this problem was resolved using the Python built-in function “pyautogui.getActiveWindowTitle()” from the PyAutoGUI library to prevent the execution of the next instruction until the data are available for the extraction.
Ah, once again having features that provide guide rails seems like it would be a good idea.
Macros for Homeschooling Software Navigation
Original link: https://www.cs.odu.edu/~cpi/old/411/orangs21/assets/lab1/V2/Lab1_Sangwoo_V2.pdf (2021)
It seems like this was some pandemic-era homeschool software. PyAutoGUI was used for "macros" though no other details were given besides that.
Parents can record macros of navigation to course materials with PyAutoGUI.
I'm not sure how parents could have created macros unless they knew Python programming, but maybe the Little Learners software did that part for them? Not much detail is given in this paper.
Controlling Desktop Apps from Mobile
Original link: Android App Development Applied to Remote Tasks Simplification
If I understand this right, this is basically using your smartphone as a remote control for tasks to carry out on a desktop app. Interesting use case.
This paper aims to present a methodology that responds to identified needs of users allowing them to execute desktop tasks from mobile applications. Based on real-time database, picture-driven computing, task automation and mobile interactive systems the approach aims to reduce the challenges that users face while trying to perform a task, as well as improving the efficiency of task performance. The approach combines the features and capabilities of both PyAutoGUI and Firebase tools to simplify the process used by such users to perform tasks.
Unfortunately the paper itself is locked behind IEEE's paywall.
Conclusion
I haven't gained any new feature ideas, but this process has shed some light on how PyAutoGUI is being used. I'm also encouraged that my own feature ideas and priorities are on the mark. If you maintain an open source project, I advise you search for it on Google Scholar to see how it's being used in academia.