Text-to-video is the next frontier for GenAI, and although current output is rudimentary recent developments hold great promise
Although the Global AI investment market is projected to reach $422.37 billion by 2028, machines had no chance of competing with humans at creative work only until recently. But they have just started to get better at creating original and aesthetic things. A powerful new class of large language models is making it possible for machines to write, code, draw and create with wonderfully creative outputs.This new category is called “Generative AI”(GenAI), meaning the machine is generating something new rather than analysing something that already exists.
Of late, Generative AI has been synonymous to creative tools like DALL-E and ChatGPT. However, there’s a lot more in the world of generative beyond these known names. While Text-to-Image AI solution is becoming quite mainstream now, we are just about to cross the next threshold with generative AI that is capable of converting text to videos. This is the next frontier for GenAI, and although current output is rudimentary, the target is to key in a textual description and the tool will generate a corresponding video in your preferred style. In theory, it is just like DALL-E – only the output here is in video, and all you invest is nothing but your words!
Big names in the fray
A text-to-video model is a machine learning model which takes as input a natural language description and produces a video matching that description. Video prediction on making objects realistic in a stable background is performed by using recurrent neural network for a sequence-to-sequence model with a connector convolutional neural network encoding and decoding each frame pixel by pixel, creating a video using deep learning.
Attempts to create text-to-video algorithms have been on for a couple of years now. But the first major announcement was made by Facebook parent Metawhen, in September 2022, it showcased a tool named Make-A-Video. With just a few words or lines of textual input, Make-A-Video could create AI-generated videos – but just 5-seconds long and without any audio output.
Barely a week later, Google caught up with Meta by announcing a similar text-to-video solution. It was called Imagen Video. Later, they also showcased another toolnamed Phenaki, which was capable of creating long-form videos on the basis of text inputs. According to Google’s own description: “Given a text prompt, Imagen Video generates high-definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models.”
A “Runway” success!
Despite these big names, Runway, an AIstart-up has recently announced a more practical text-to-video solution. Called Gen-1, this tool can convert existing videos into new ones by applying any style specified by a text prompt or reference image. It is a video-to-video model that generates new videos out of existing onesbased ontextual and image prompts. The solution focuses on transforming existing video footage, letting users input a rough 3D animation or shaky smartphone clip and apply an AI-generated overlay.
Though a very small start-up – with about 45 employees in all – Runway is no pushover when it comes to GenAI. This was the company that co-created the open-source text-to-image model Stable Diffusion. And after releasing Gen-1 this February, they have already onto their next: Gen-2. While Gen-1 was a video-to-video model, Gen-2 is just the text-to-video sci-fi stuff we are talking of.
Gen-2 is focused on generating videos from scratch. Here, you can generate a video with a simple text prompt, just the way we generate images in DALL-E. Refinement, though, is still a long way off. Access is yet another big issue. Bloomberg reports that users will have to sign up to join a waitlist for Gen-2.
Challenges and concerns
It appears from the demo clips released by Runway that the video outputs are very short, unstable, and not photorealistic. And with all text-to-video tools, the major issue would be generating a video with just words. One will have to be incredibly precise with the commands or it could generate the video equivalent of gibberish. Just imagine how detailed the text prompt would have to be for the tool to analyse the situation and create an action sequence on video with every realistic detail! As Meta CEO Mark Zuckerberg has underlined in a post: “It’s much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they’ll change over time.”
As with all GenAI, security and ethical concerns will plague text-to-video solutions. Realistic deepfakes will be even more easy to create for a layperson using such tools. For now, Google has restrictedtheir Imagen Video tool purely as a research project to avoid harmful outcomes.And Phenaki has in-built safety features, like not generating images of people.
A take-off point
A lot of companies have already joined the race for text-to-video GenAI solutions in the commercial space. This includes:DeepBrain AI, ModelScope, VEED.io, Lumen5, Designs.AI, Synthesia.io, InVideo.io and a few others. All are in the experimental phase – good to play with but not yet practical.
Text-to-video solutions are definitely an uncharted domain as yet. Perhaps they would never replace the traditional video-making process. But such tools can be an excellent low-cost/no-cost way to create draft videos and model scenes before the final version is shot.
Know more about the syllabus and placement record of our Top Ranked Data Science Course in Kolkata, Data Science course in Bangalore, Data Science course in Hyderabad, and Data Science course in Chennai.