It Chats, It Draws – AI Gets Creative – Part I

It Chats, It Draws – AI Gets Creative – Part I

Part I: Flamingo from Google Deepmind

In their quest for increasingly complex AI applications, scientists are now mixing image with text recognition. Let’s explore Flamingo in this first part

We all know that artificial intelligence (AI) can identify text. We also know that AI can identify images. In each case, the algorithm had to be first trained on a vast repository of text or images – depending on what needed to be identified by the model. However, in their quest for increasingly complex AI applications, scientists are now working on mixing these together. The obvious result is to get text output from image-based inputs, or new image outputs based on text instructions. Such multimodal Machine Learning research is expected to open up new frontiers in AI algorithms. We already have two of the biggest name in AI working on their own versions of such creative AI – GoogleDeepmind’sFlamingo and OpenAI’sDall-E.

While Flamingo is a “few-shot visual language model” that can return text outputs after being trained via images, Dall-E can create stunningly realistic image output and art from any text description. And they are not only playing with mixed input-output possibilities but much more in the process. Both are still working under process and are continuously being refined – and they are ready to transcend quite a few boundaries in AI research.

Let’s talk about Flamingoin in this first part and Dall-E in the next.

What can Flamingo do?

Deepmind’s Flamingo has combined a visual AI model with a language model. It can perform image analysis after being trained on two to three example images with explanatory text tags – and then answer questions about new images or videos through natural language text output. To put it plainly, when trained with a picture of a cheerful puppy along with an explanatory text that read, “This is a very cute dog”, and then presented with the image of a grave cat and asked, “This is: _ _ _”, Flamingo could reply “This is a very serious cat.”

Figure 1: Flamingo responds to the cat image using the dog example; Source: Deepmind

Flamingo can also perform surprisingly meaningful conversations by processing information from pictures and texts. In the example below, while chatting with a human, the AI model could correct itself on its own when a possible error was pointed out by the person.

Figure 2: Flamingo corrects itself during chat;
Source: Deepmind

A“few-shot” learner

The effectiveness of learning depends on how quickly a learner can assimilate new knowledge and how much input instruction was needed for it.  In short, learning curve and training effort. Usually, a human child recognizes a real-life object after seeing a few images of it – in picture books perhaps. Machine learning, however, is not that efficient. Typically, an image-recognition algorithm would require to be trained using thousands of images that have been elaborately annotated with tags. As Deepmind explains in their blog:

“If the goal is to count and identify animals in an image, as in “three zebras”, one would have to collect thousands of images and annotate each image with their quantity and species. This process is inefficient, expensive, and resource-intensive, requiring large amounts of annotated data and the need to train a new model each time it’s confronted with a new task.”

However, large language models like OpenAI’s GPT-3 are few-shot learners – which means, these models can be trained to perform a task based on very few examples. For example, a GPT-3 model can be set up for a German to English translation with just two to three example translations. Of course, this happens only because GPT-3 has been pre-trained with countless data which has been pre-fed into the system. The training with three examples simply allows the model to fine-tune itself for the job so that it can draw upon the correct bank of pre-fed learning.

The uniqueness of Flamingo lies in the fact that it combines both a language model and a visual model to perform image analysis using few-shot learning. The developers claim that it outperforms all previous few-shot learning approaches, even those fine-tuned with immensely more humongous data banks. The Deepmind blog writes:

“Flamingo beats all previous few-shot learning approaches when given as few as four examples per task. In several cases, the same Flamingo model outperforms methods that are fine-tuned and optimized for each task independently and use multiple orders of magnitude more task-specific data. This should allow non-expert people to quickly and easily use accurate visual language models on new tasks at hand.”

The end result is a family of machine learning models that could do more work with far less costly and time-consuming training

The Flamingo architecture

Flamingo combines pre-trained language models individually with powerful visual representations and unique architecture components in practice. Deepmind trains Flamingo using Chinchilla, its recently released 70 billion parameters pre-trained language model, obviating the need for any extra task-specific fine-tuning. The Flamingo team “fused” the Chinchilla LM with visual learning elements “by adding novel architecture components in-between” that keep training data isolated and frozen, giving them the 80-billion parameter Flamingo FLM.

The model can be directly applied to visual tasks after this training. Deepmind uses an in-house dataset it created especially for multimodal ML research. The 43.3 million-item training dataset –a mix of complementing unlabelled multimodal data, comprising 185 million images and 182GB of text – was entirely sourced from the internet.

Confident, yet cautious

Deepmind has announced Flamingo through the preprint of an academic paper authored by the development team.

The potential uses of this machine learning model are readily apparent, and aren’t restricted to what Flamingo is able to do with data – the model could also help the general state of machine learning, which is facing a problem of growing energy and computing needs to train newer models. According to one estimate, a single Google BERT training session emitted the same amount of carbon as a trans-American jet flight.

However, the researchers admit “there is no ‘golden’ few-shot method that would work well in all scenarios.” They acknowledge that there are too many variables to account for when a training dataset is so small.  But they are confident that Flamingo can be quickly adapted to low-resource environments and activities, such as analyzing data for PII, societal prejudices, stereotypes, and other variables.

Although highly optimistic, the Deepmind blog closes with an ominous word of caution:

“We also tested the model’s qualitative capabilities beyond our current benchmarks. As part of this process, we compared our model’s performance when captioning images related to gender and skin colour, and ran our model’s generated captions through Google’s Perspective API, which evaluates the toxicity of text. While the initial results are positive, more research towards evaluating ethical risks in multimodal systems is crucial and we urge people to evaluate and consider these issues carefully before thinking of deploying such systems in the real world.”

You can access the full paper by the Flamingo team at: flamingo.pdf (

(Read about Open AI’s Dall-E in the second part)

Know more about the syllabus and placement record of our Top Ranked Data Science Course in KolkataData Science course in BangaloreData Science course in Hyderabad, and Data Science course in Chennai

© 2023 Praxis. All rights reserved. | Privacy Policy
   Contact Us
Praxis Tech School
PGP in Data Science