DALL-E: From Caption to Image
What connects a twentieth-century surrealist artist, a Pixar-animated film from 2008 and an AI-backed neural engine? The richest man in the world, of course.
Salvador Dali was a surrealist painter born in Spain in 1908. Exactly a hundred years later, in 2008, Pixar Animation Studios, a subsidiary of the Walt Disney Studios, released an animated film about the lost robot left on the earth, called Wall-E. Thirteen years hence, Elon Musk-backed AI laboratory, OpenAI, has brought the two together to piece together ‘DALL-E’ — a portmanteau formed by the juxtaposition of ‘Dali’ and ‘Wall-E’.
The Dali of the AI world
Dall-E is essentially a piece of from the labs of OpenAI that has managed to generate images from a short caption alone. The neural engine working behind the application specifically uses a dataset of 12 billion images along with their captions from the wide abyss of the internet. The images are quirky and innovative AI-backed software, ranging from armchairs in the shape of avocados to baby radishes walking dogs in tutus. In a blog post published very recently, OpenAI writes: “We’ve found that it [Dall-E] has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.”
Source: OpenAIblog
The ingenuity behind DALL-E is the fact that it is the first-and-only neural engine of its kind that can coherently generate images or videos while relying solely on text inputs. Whilst there are several AI or machine learning-based image or video generators in the market, not a single one has the ability to produce images from captions alone. In general, the production of synthetic images and videos have gained much popularity over the recent past – leading to the creation of several ‘deepfakes’, for example. These generally use General Adversarial Networks (GANs) employing two neural networks in order to carry out their processes.
According to a report from CNBC USA, “OpenAI acknowledged that DALL-E has the ‘potential for significant, broad societal impacts,’ adding that it plans to analyze how models like DALL-E ‘relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer-term ethical challenges implied by this technology.’”
OpenAI on a roll
The release of DALL-E from the house of OpenAI comes only a few months after the launch of GPT-3, currently regarded as the world’s most advanced natural language processing (NLP) AI software. The GPT-3 is a language-generation tool capable of producing high-quality human-like text – even impressively writing its own news articles, short stories and poetry.
OpenAI writes: “Like GPT-3, DALL-E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens and is trained using maximum likelihood to generate all of the tokens, one after another. This training procedure allows DALL-E to not only generate an image from scratch, but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a way that is consistent with the text prompt.” As an extension of the GPT-3 engine, DALL-E is an adept Text-to-Image system that has been trained not just on text, but on images as well.
From the standpoint of artificial engines having creativity through the ability to coherently blend concepts together, this is a great step forward in the right direction. According to former director of machine learning at Amazon, Neil Lawrence, DALL-E looks a ‘very impressive’ engine that accurately demonstrates the ability of AI-based models “to store information about our world and generalize in ways that humans find very natural.”
We have been writing captions from images for a long time; it seems the time has now come to create images out of our captions!
To find out more, you can visit: https://openai.com/blog/dall-e/