Software Engineering

AI Image Generation With GPT and Diffusion Models

The world is captivated by artificial intelligence (AI), particularly by recent advances in natural language processing (NLP) and generative AI—and for good reason. These breakthrough technologies have the potential to enhance day-to-day productivity across tasks of all kinds. For example, GitHub Copilot helps developers rapidly code entire algorithms, OtterPilot automatically generates meeting notes for executives, and Mixo allows entrepreneurs to rapidly launch websites.

This article will give a brief overview of generative AI, including relevant AI technology examples, then put theory into action with a generative AI tutorial in which we’ll create artistic renderings using GPT and diffusion models.

Six AI-generated images of the article’s author in various animated and artistic styles.
Six AI-generated images of the author, created using the techniques in this tutorial.

Brief Overview of Generative AI

Note: Those familiar with the technical concepts behind generative AI may skip this section and continue to the tutorial.

In 2022, many foundation model implementations came to the market, accelerating AI advances across many sectors. We can better define a foundation model after understanding a few key concepts:

  • Artificial intelligence is a generic term describing any software that can intelligently work toward a specific task.
  • Machine learning is a subset of artificial intelligence that uses algorithms that learn from data.
  • A neural network is a subset of machine learning that uses layered nodes modeled after the human brain.
  • A deep neural network is a neural network with many layers and learning parameters.

A foundation model is a deep neural network trained on huge amounts of raw data. In more practical terms, a foundation model is a highly successful type of AI that can easily adapt and accomplish various tasks. Foundation models are at the core of generative AI: Both text-generating language models like GPT and image-generating diffusion models are foundation models.

Text: NLP Models

In generative AI, natural language processing (NLP) models are trained to produce text that reads as though it were composed by a human. In particular, large language models (LLMs) are especially relevant to today’s AI systems. LLMs, classified by their use of vast amounts of data, can recognize and generate text and other content.

In practice, these models may serve as writing—or even coding—assistants. Natural language processing applications include restating complex concepts simply, translating text, drafting legal documents, and even creating workout plans (though such uses have certain limitations).

Lex is one example of an NLP writing tool with many functions: proposing titles, completing sentences, and composing entire paragraphs on a given topic. The most instantly recognizable LLM of the moment is GPT. Developed by OpenAI, GPT can respond to almost any question or command in a matter of seconds with high accuracy. OpenAI’s various models are available through a single API. Unlike Lex, GPT can work with code, programming solutions to functional requirements and identifying in-code issues to make developers’ lives notably easier.

Images: AI Diffusion Models

A diffusion model is a deep neural network that holds latent variables capable of learning the structure of a given image by removing its blur (i.e., noise). After a model’s network is trained to “know” the concept abstraction behind an image, it can create new variations of that image. For example, by removing the noise from an image of a cat, the diffusion model “sees” a clean image of the cat, learns how the cat looks, and applies this knowledge to create new cat image variations.

Diffusion models can be used to denoise or sharpen images (enhancing and refining them), manipulate facial expressions, or generate face-aging images to suggest how a person might come to look over time. You may browse the Lexica search engine to witness these AI models’ powers when it comes to generating new images.

Tutorial: Diffusion Model and GPT Implementation

To demonstrate how to implement and use these technologies, let’s practice generating anime-style images using a HuggingFace diffusion model and GPT, neither of which require any complex infrastructure or software. We will begin with a ready-to-use model (i.e., one that’s already created and pre-trained) that we will only need to fine-tune.

Note: This article explains how to use generative AI images and language models to create high-quality images of yourself in interesting styles. The information in this article should not be (mis)used to create deepfakes in violation of Google Colaboratory’s terms of use.

Setup and Photo Requirements

To prepare for this tutorial, register at:

You’ll also need 20 photos of yourself—or even more for improved performance—saved on the device you plan to use for this tutorial. For best results, photos should:

  • Be no smaller than 512 x 512 px.
  • Be of you and only you.
  • Have the same extension format.
  • Be taken from a variety of angles.
  • Include three to five full-body shots and two to three midbody shots at a minimum; the remainder should be facial photos.

That said, the photos do not need to be perfect—it can even be instructive to see how straying from these requirements affects the output.

AI Image Generation With the HuggingFace Diffusion Model

To get started, open this tutorial’s companion Google Colab notebook, which contains the required code.

  1. Run cell 1 to connect Colab with your Google Drive to store the model and save its generated images later on.
  2. Run cell 2 to install the needed dependencies.
  3. Run cell 3 to download the HuggingFace model.
  4. In cell 4, type “How I Look” in the Session_Name field, and then run the cell. Session name typically identifies the concept that the model will learn.
  5. Run cell 5 and upload your photos.
  6. Go to cell 6 to train the model. By checking the Resume_Training option before running the cell, you can retrain it many times. (This step may take around an hour to complete.)
  7. Finally, run cell 7 to test your model and see it in action. The system will output an URL where you will find an interface to produce your images. After entering a prompt, press the Generate button to render images.
A screenshot of the model’s user interface with many configurations, an input text box, a “generate” button, and an output of an animated character.
The User Interface for Image Generation

With a working model, we can now experiment with various prompts producing different visual styles (e.g., “me as an animated character” or “me as an impressionist painting”). However, using GPT for character prompts is optimal, as it yields added detail when compared to user-generated prompts, and maximizes the potential of our model.

Effective Diffusion Model Prompts With GPT

We’ll add GPT to our pipeline via OpenAI, though Cohere and the other options offer similar functionality for our purposes. To begin, register on the OpenAI platform and create your API key. Now, in the Colab notebook’s “Generating good prompts” section, install the OpenAI library:

pip install openai

Next, load the library and set your API key:

import openai
openai.api_key = "YOUR_API_KEY"

We will produce optimized prompts from GPT to generate our image in the style of an anime character, replacing YOUR_SESSION_NAME with “How I Look,” the session name set in cell 4 of the notebook:

ASKING_TO_GPT = 'Write a prompt to feed a diffusion model to generate beautiful images '\
                'of YOUR_SESSION_NAME styled as an anime character.' 
response = openai.Completion.create(model="text-davinci-003", prompt=ASKING_TO_GPT,
                                    temperature=0, max_tokens=1000)

The temperature parameter ranges between 0 and 2, and it determines whether the model should strictly adhere to the data it trained on (values close to 0), or be more creative with its outputs (values close to 2). The max_tokens parameter sets the amount of text to be returned, with four tokens being equivalent to approximately one English word.

In my case, the GPT model output reads:

"Juan is styled as an anime character, with large, expressive eyes and a small, delicate mouth.
His hair is spiked up and back, and he wears a simple, yet stylish, outfit. He is the perfect
example of a hero, and he always manages to look his best, no matter the situation."

Finally, by feeding this text as input into the diffusion model, we achieve our final output:

Six AI-generated images of the article’s author styled as various anime characters.
Six AI-generated images of the author, refined with GPT-generated prompts.

Getting GPT to write diffusion model prompts means that you don’t have to think in detail about the nuances of what an anime character looks like—GPT will generate an appropriate description for you. You can always tweak the prompt further according to taste. With this tutorial completed, you can create complex creative images of yourself or any concept you want.

The Advantages of AI Are Within Your Reach

GPT and diffusion models are two essential modern AI implementations. We have seen how to apply them in isolation and multiply their power by pairing them, using GPT output as diffusion model input. In doing so, we have created a pipeline of two large language models capable of maximizing their own usability.

These AI technologies will impact our lives profoundly. Many predict that large language models will drastically affect the labor market across a diverse range of occupations, automating certain tasks and reshaping existing roles. While we can’t predict the future, it is indisputable that the early adopters who leverage NLP and generative AI to optimize their work will have a leg up on those who do not.

The editorial team of the Toptal Engineering Blog extends its gratitude to Federico Albanese for reviewing the code samples and other technical content presented in this article.