An Exploration of AI Art Generation

I recently discussed AI Art generation and my discovery of it a while back. Since then, I’ve become involved in the AI Art community and learning about the different models that exist out there in addition to the different methods. I’ll primarily focus on four: DALL-e, StyleGAN3, VQGAN, and Wander (previously BigGAN). All of these models use CLIP diffusion. I will also apologize if this goes beyond the reader’s head.

Before I start, I’ll get through the terminology as basic as I can:

  • CLIP – Contrastive Language-Image Pre-training. Created by Google’s OpenAI team, this basically creates a score comparing what an object looks like against what it should be looking at.
  • DALL-E – Doesn’t stand for anything and is a portmanteau of WALL-E. Also created by the OpenAI team, this utilizes OpenAI to create an image based off a text prompt.
  • StyleGAN3 – NVidia’s AI team has also created an image generation method from text prompt. GAN means Generative Adversarial Network, and 3 means its their third iteration of it.
  • VQGAN – VQ stands for Vector Quantized and GAN I mentioned above, is a method created by some machine learning students at Heidelberg University in Germany.
  • BigGAN – A different method by Google’s TensorFlow team

Now that you’re an expert in taming transformers for image synthesis, I’ll get into the point of this blog – or article – or whatever you want to call it. I’ve found my prefered method in VQGAN, but that’s because it consistently gives me something I prompt it to. However, that isn’t to deter the other methods either. I’ll breakdown what I believe each one works best for based on my experience:

  • DALL-E – Best with people or places
  • StyleGAN3 – Best with portrait or photography styles
  • VQGAN – Good for imitation or original thought
  • BigGAN – Simulating specific objects

Now for the examples:

Prompt: “A robot standing in a haunted graveyard”


Prompt: Hogwarts by Claude Monet


As you can see, there are different styles for each one. DALL-E did a great job with a robot, but I don’t see much of a graveyard unlike StyleGAN3 and VQGAN. StyleGAN3 gave it a feel that reminds me of a video game during the 32-bit era, while the VQGAN made it feel like impressionism.

With Hogwarts in the style of Monet, VQGAN nailed it. DALL-E gave it a photorealistic feel, which was a very surprising outcome. StyleGAN3 almost looks like Denver International Airport, while BigGAN looks like some castle in front of a forest.


There are different training models that are built by various people. A model is basically someone connecting an image with text like “this is a bear eating grass in a meadow”. They do this hundreds to thousands of times to create an image model. Each of the methods I described above evaluates these models and builds a simulation of the text given their knowledge. This, of course, can cause problems. For example, if I use an anime face training model and ask it to create a steampunk submarine it will probably look ridiculous.

steampunk submarine of an anime face

Okay, I cheated here since I don’t have a model for anime faces, but it turned out cool regardless.

Since models are different between each creator(s), we can expect different results. Below are different models using the same Monet prompt above on VQGAN. The above model used is called ImageNet 16384.

Wikiart 16384
ImageNet 1024

Prompt: Steampunk windmill by a river

ImageNet 16384
Gumbel 8192
WikiArt 16384

Cool. I use two colab books for this that I created:

I’ll also be selling canvas, metal, and acrylic prints of my favorite creations at