It Chats, It Draws – AI Gets Creative – Part II

It Chats, It Draws – AI Gets Creative – Part II

Part II: Dall-E from OpenAI

In their quest for increasingly complex AI applications, scientists are now mixing image with text recognition. In this second part, we explore Dall-E

Google Deepmind’s Flamingo and OpenAI’s Dall-E are two recent developments in creative AI in which image identification and text interpretation algorithms have been merged to create multimodal Machine Learning models. The end result is either text output from image-based inputs, or new image outputs– depending on the architecture.In Part I, we had discussed Flamingo. This episode is all about Dall-E which produces images based on text instructions

Why the name?

The name, Dall-E, is a direct tribute to the heart-rending animation movie WALL-E – a Disney-Pixar collaboration about a lonely autonomous robot, and also to the great surrealist painter Salvador Dali. Why this bizarre mix – one might ask. To understand that you need to know what Dall-E does. It is an autonomous ML algorithm that can create neat photo-perfect, or painting-style, image outputs based on any textual instruction you feed in.

How Dall-E works?

Dall-E is an AI system that can create realistic images and art from a description in natural language. However wild and fantastic the instruction might be, Dall-E will produce the image. Sample some of the instructions during the testing phase, and you will grasp its performance range:

  • a monkey buying banana in a mall next to a bear dancing
  • a penguin wearing a Christmas sweater
  • a painting of a fox sitting in a field during winter
  • an armchair in the shape of an avocado
  • a green leather purse shaped like a pentagon
  • a cube with the texture of a porcupine
  • an astronaut riding a horse in a photorealistic style

For each of these requests, Dall-E came up with wonderfully realistic images. It can also generate images in the specific painting style of great masters if so instructed. It also provides several options for each output.

Fig 1: Dall-E 2 output for the instruction: “an astronaut riding a horse in a photorealistic style”; Source:

OpenAI first introduced Dall-E in January 2021. The improved second version, Dall-E 2 has just been released in April this year. Dall-E 2 image outputs are way more realistic with 4x greater resolution and much closer to input instructions. According to OpenAI: “Dall-E 2 has learned the relationship between images and the text used to describe them. It uses a process called “diffusion,” which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that image.”

As per OpenAI data, Dall-E 2 is preferred over Dall-E 1 for its caption matching (71.7% more preferred) and photorealism (88.8% more preferred) when evaluators were asked to compare 1,000 image generations from each model. This goes to show that the model has been greatly refined in the span of just one year. It is an indicator of how fast research is moving in the Machine Learning domain.

For what use?

Generating offbeat imaginative images is not really the target aimed by the developers. It is just a fun way of demonstrating the capabilities of Dall-E. The developers sum up the basic features of Dall-E 2in just three points:

  • Dall-E 2 can create original, realistic images and art from a text description. It can combine concepts, attributes, and styles.
  • Dall-E 2 can make realistic edits to existing images from a natural language caption. It can add and remove elements while taking shadows, reflections, and textures into account.
  • Dall-E 2 can take an image and create different variations of it inspired by the original.

Within these three crisp bullet points lie an immense possibility of ground-breaking advances in imaging and visual-based research. Working with images will never be the same again!

In somewhat more technical terms, Dall-E uses a 12-billion parameter multimodal implementation of the GPT-3 Transformer model to interpret natural language inputs and generate corresponding images – both of realistic objects as well as objects that do not exist in reality. It uses zero-shot learning to generate output from a description and cue without further training. The high point is, although there had been other neural networks that could generate realistic images, Dall-E is the first to generate them from natural language prompts – and by interpreting the instructions just like human intelligence does.

Not yet open to all

The developers still consider Dall-E 2 to be an ongoing project. Hence, it is not yet available in OpenAI’s API section. Although OpenAI has not released the source code, a “controller” version is available on OpenAI’s website. In February this year, OpenAI invited 23 external researchers as a “red team” to flag inherent flaws and vulnerabilities. The red team recommended releasing Dall-E 2to only trusted users.

Currently, 400 people (includingOpenAIdevelopersplus selected external academics and researchers) have access. The company states: “We’ve been working with external experts and are previewing Dall-E 2 to a limited number of trusted users who will help us learn about the technology’s capabilities and limitations.”

The use is strictly limited to non-commercial purposes. So, why this iron curtain?

Caution is the key

Let us admit it, Dall-E 2 can take disinformation to new levels. Till now, creating fake images required considerable skills in Photoshop or other similar photos/video-editing software. With Dall-E 2, anyone can generate stunningly authentic-looking images simply through instructions in everyday language! And that is scary indeed. Scary, because like all neural networks, Dall-E, has been trained on images and text through material sourced from the public domain internet. And online information always contains a lot of misinformation, PLUS all inherent human biases – social, gender, ethnic – and everything else. Naturally, Dall-E has inherited all of that.

OpenAI admits this openly. They give examples of how Dall-E 2 returns images of south-Asian females in response to the prompt: “a flight attendant”, and images of macho white males for the prompt: “a builder”. To mitigate these, OpenAI needs to train prompt classifiers conditioned on the content they lead to, and explicit language included in the prompt.

Fig 2: Dall-E 2 outputs for the instructions: “a flight attendant” and: “a builder”; Source:

OpenAI writes that as part of their effort to develop and deploy AI responsibly, these limitations are being studied in collaboration with a select group of users. Safety mitigations already developed include preventing harmful generations – like violent, hate, or adult images by removing too explicit content from the training data. The company is also working on curbing misuse of generated content. They also assure that the system: “…won’t generate images if our filters identify text prompts and image uploads that may violate our policies”.

All-in-all, creative multimodal AI systems like Dall-Eand Flamingo are fast expanding the horizons of machine learning in a way we never dreamt of.

Anyone interested can explore Dall-E at DALL·E 2 (


Know more about the syllabus and placement record of our Top Ranked Data Science Course in KolkataData Science course in BangaloreData Science course in Hyderabad, and Data Science course in Chennai.

© 2024 Praxis. All rights reserved. | Privacy Policy
   Contact Us