Carpe Diem - A Case Study of Text and Consistency in AI-Generated Images
Posted by Al Sweigart in misc
I'd like to think of myself as an AI realist: admitting that there are uses of AI (specifically LLMs and image generators) while also pointing out the blatantly ridiculous lies and hype that characterizes most of industry and its reporting. As an example, in this blog post I describe my tests using ChatGPT 5's image generation capabilities: both incredible technological achievements and utterly useless for their claimed purpose. In particular, I wanted to focus on images that contain text and consistency of the subject across multiple images. What I found is that, while there are impressive results, there continue to be what could be written off as "minor glitches" if it weren't for the fact that generative AI has been unable to solve them. It may be that these "minor glitches" are fundamentally unsolveable by generative AI. My conclusion: AI generated imagery is great... as long as the details don't matter.
Summary:
- ChatGPT 5 can produce images with actual text characters. However, it still has trouble producing consistent
A couple years ago I played around with stable diffusion tools running locally on my gaming PC to generate art. The images they produced were... inconsistent. Funny enough, their tendency to produce deformed and surreal images worked great for producing horror-themed images. There's a sort of dreamlike quality to them, in a "I'm back in high school taking finals but I haven't attended class all semester and now I'm going to fail" kind of way. We don't stop and think about how unrealistic the situation is, and it's only after we wake up and start questioning it that we notice all the faults.
One problem is that "AI art" is a poor name. Stable diffusion is a machine learning algorithm. There's no "AI" thinking about the meaning behind your prompt before it begins drawing. Images produced by stable diffusion don't "understand" the text prompts you give them. This is evident when you ask it to produce text: the algorithm doesn't know what words mean, but it can produce pictures that gesture at letter-like shapes and where you'd find them in an image. Here's some examples of "Movie poster" generated with DiffusionBee:
Which is why I was surprised that ChatGPT 5 could create images with clear, perfectly legible text. I gave it the prompt: Create an image of an alphabet poster suitable for a classroom, showing the uppercase and lowercase letters in the correct order in a fun, colorful, cartoony font.
While the letters are now indeed letter-shaped, it's still clear that the system doesn't really understand language:
On first glance, this seems to be the kind of alphabet poster you'd find in any kindergarten classroom. But on close inspection, well, I hope I don't have to point out why this is utterly wrong.
My guess is that ChatGPT 5 is composed of a separate subsystem that can produce text characters along with their styles, orientations, and positions, and later digitally composite it into the AI-drawn image. I had much more success generating an SVG image. Scalable vector graphic "images" are actually made from a text-based language which describes individual lines and shapes, unlike regular bitmap images which are two-dimension areas of pixels. This makes SVGs great for illustrations, though you can't use them for photographic images. I gave ChatGPT the same alphabet poster prompt, but added that I wanted it as an SVG:
Here's the original SVG file: carpe-diem-alphabet.svg
Forcing the image to be an SVG means that this is more of a text-generation task than an image-generation one, so it doesn't surprise me that the actual text came out more accurate.
Let's go into a task that requires a photographic image that includes text. I gave ChatGPT 5 this prompt: Create an image of a t-shirt with the words "Carpe diem" on it, like something you'd see on an ecommerce shop.
It gave me a perfectly suitable image, one that almost looks exactly like a simple photoshop job a human would produce from a stock image of a white t-shirt:
The text is perfect, and doesn't suffer from the dream quality of stable diffusion text. But another problem of AI generated imagery was consistency across multiple images. For example, it can be hard or impossible to illustrate a children's book with AI art since the characters are drawn differently from picture to picture.
I also wanted to check that it wasn't merely using text from a font, so I gave ChatGPT 5 this prompt: Make the text look like it was drawn by a child in crayon.
One detail that immediately leapt out to me was that the entire image wasn't being regenerated: the folds of the shirt in the new image matched the previous image exactly (albeit there's a slight tint to it.) This tells me that their system is isolating and remembering different objects so it can reuse them in future images. This is not just stable diffusion generating images whole cloth; it is identifying the imagery of different objects for reuse in future images.
I continued with the theme of an online t-shirt ecommerce site. I asked ChatGPT: Give me an image that looks like a photo of a stock photography model wearing this shirt.
This looks acceptable. A photorealistic human figure against a plain backdrop, wearing the shirt. I'll pause a moment and take in how incredible this is. This is something that was only imagineable ten years ago: we give a brief text description and within a minute we have an image. This isn't an algorithm that assembles a collage from existing photos; it is completely unique and original.
It's both incredible, but also completely unuseable for our ecommerce site.
The shirt has changed. The colors of the many letters are distinctly different. This sort of detail might not matter sometimes, but misrepresenting your product's appearance can have legal and reputational consequences.
I asked ChatGPT to fix it: Make sure the color of the text matches the color of the crayon drawing in the shirt-only image.
We see again a great accomplishment: The picture of the person and their clothes are the same, down to the folds in the clothing and strands of hair. There is a slight discoloration and tiny change to the shirt near the text.
However, while the text has changed, it hasn't been fixed. It's simply adopted new wrong colors. We cannot maintain consistency with the design on the previous shirt-only picture.
Let's move on. I wanted to see how ChatGPT handled modification requests to the image: Make the shirt lime green instead of white.
Again, on first look, the AI has done the job. But the colors of the letters have changed again. And on closer inspection, even the shape of the letters have changed.
Let's go for a larger change: Have the model wear a cat mask on her head.
Okay, this is weird but my fault. Let's notice a couple differences. The subject has moved down to accomodate the cat head and her facial expression has changed. When I layer the images on top of each other in a graphics app, I notice that everything has changed: hair strands, clothing folds, backdrop color. It seems the image has been regenerated, however it has retained remarkable conistency: it looks like the same person. The color of the letters has changed slightly again.
Albeit wearing a decapitated cat. Let's fix that: No, she should be wearing a cat-themed mask on her face, not a cat animal's head on her head.
That's less horrifying. But after layering the two photos on top of each other, I can see that again the figure has been moved up and completely regenerated. The colors of the letters have changed again. The folds on the clothing and hair strands are very similar but different.
TODO
Make this photo a side profile instead of facing directly to the camera.
TODO
Make the model in the middle of doing a cartwheel.
TODO
Make the jeans red denim and the cat mask should be of an orange cat.
TODO