AI is learning to complete images from half a picture
“Today, the United Nations has called for the immediate withdrawal of all nuclear weapons from the world.” This sentence wasn’t written by a human, but by an artificial intelligence algorithm code named GPT-2 (the fancy acronym stands for Generative Pretrained Transformer-2). Demonstrating this stunning development last year, the creator of GPT-2, San Francisco-based research lab OpenAI, claimed that they had only supplied the word “Today” to this program. The rest of the sentence was constructed by the engine on its own. And now it is doing the same thing with images. The success of unsupervised learning methods and transformer models in natural language processing (NLP) inspired OpenAI researchers to explore this new direction.
According to OpenAI; “We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting.”
GPT-2 is part of a new breed of text-generation systems that have impressed experts with their ability to generate coherent text from minimal prompts. The system was trained on eight million text documents scraped from the web and responds to text snippets supplied by users. Feed the beginning of a sentence or paragraph into GPT-2, and it could continue the thought for as long as an essay with almost human-like coherence. OpenAI rightly felt that it was too dangerous to release this program, and so held it back.
OpenAI has now taken GPT-2 image completion results, which were given an honourable mention for best paper at International Conference on Machine Learning, throwing up a new avenue for image generation, embedded with tremendous opportunity and consequences.
At its core, GPT-2 is a powerful prediction engine. It learned to grasp the structure of the English language by looking at billions of examples of words, sentences, and paragraphs – all scraped from the corners of the internet. With that structural framework in place, it could then manipulate words into new sentences by statistically predicting the order in which they should appear.
So, researchers at OpenAI decided to switch the words for pixels and train the same algorithm on images in ImageNet, the most popular image bank for deep learning. Because the algorithm was designed to work with one-dimensional data (i.e., strings of text), they unfurled the images into a single sequence of pixels. They found that the new model, named iGPT, was still able to grasp the two-dimensional structures of the visual world. Given the sequence of pixels for the first half of an image, it could predict the second half in ways that a human would deem sensible.
The results are startlingly impressive and demonstrate a new path for using unsupervised learning, which trains on unlabelled data, in developing computer vision systems. While early computer vision systems in the mid-2000s tested such techniques before, they fell out of favour as supervised learning – which uses labelled data – proved far more successful. The benefit of unsupervised learning, however, is that it allows an AI system to learn about the world without a human filter, and significantly reduces the manual labour involved in labelling data meticulously.
The fact that iGPT uses the same algorithm as GPT-2 also shows its promising adaptability. This is in line with OpenAI’s ultimate ambition of achieving more generalizable machine intelligence.
At the same time, the method presents a disconcerting new way to create deepfake images. Generative adversarial networks, the most common category of algorithms used to create deepfakes in the past, must be trained on highly curated data. If you want to get a GAN to generate a face, for example, its training data should only include faces. iGPT, by contrast, simply learns enough of the structure of the visual world across millions and billions of examples to spit out images that could feasibly exist within it. While training the model is still computationally expensive, offering a natural barrier to its access, that may not be the case for long. It is quite possible that the same algorithm in the next stage of evolution could also generate videos on its own. This is fraught with unwanted consequences. And that will fuel the debate on ethical AI yet again.