Saturday, March 15, 2025

Text-to-video model

 A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.[2]

Models

[edit]

There are different models, including open source models. Chinese-language input[3] CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022.[4] That year, Meta Platforms released a partial text-to-video model called "Make-A-Video",[5][6][7] and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net.[8][9][10][11][12]

In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.[13] The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.[14] In the same month, Adobe introduced Firefly AI as part of its features.[15]

In January 2024, Google announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.[16] Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.[17] In June 2024, Luma Labs launched its Dream Machine video tool.[18][19] That same month,[20] Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology.[21] By September 2024, the Chinese AI company MiniMax debuted its video-01 model, joining other established AI model companies like Zhipu AIBaichuan, and Moonshot AI, which contribute to China’s involvement in AI technology.[22]

Alternative approaches to text-to-video models include[23] Google's Phenaki, Hour One, Colossyan,[3] Runway's Gen-3 Alpha,[24][25] and OpenAI's Sora,[26] [27] Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.[28] Google is also preparing to launch a video generation tool named Veo for YouTube Shorts in 2025.[29] FLUX.1 developer Black Forest Labs has announced its text-to-video model SOTA.[30]

Architecture and training

[edit]

There are several architectures that have been used to create Text-to-Video models. Similar to Text-to-Image models, these models can be trained using Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively.[31] An alternative for these include transformer models. Generative adversarial networks (GANs), Variational autoencoders (VAEs), — which can aid in the prediction of human motion[32] — and diffusion models have also been used to develop the image generation aspects of the model.[33]

Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M.[34][35] These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM.[34][35] These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts.

The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence.[35] This predictive process is subject to decline in quality as the length of the video increases due to resource limitations.[35]

Limitations

[edit]

Despite the rapid evolution of Text-to-Video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs.[36][37] Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility.[37][36]

Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model’s ability to align generated video with the user’s intended message.[37][35] Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation.[37]

Ethics

[edit]

The deployment of Text-to-Video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent.[38] Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences.[38]

Impacts and applications

[edit]

Text-to-Video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate high-quality, dynamic content.[39] These features provide users with economical and personal benefits. The feature film The Reality of Time, the world's first full-length movie to fully integrate generative AI for video, was completed in 2024. Narrated in part by John de Lancie (known for his iconic role as "Q" in Star Trek: The Next Generation). Its production utilized advanced AI tools, including Runway Gen-3 Alpha and Kling 1.6, as described in the book Cinematic A.I. The book explores the limitations of text-to-video technology, the challenges of implementing it, and how image-to-video techniques were employed for many of the film's key shots.

Comparison of existing models

[edit]
Model/ProductCompanyYear releasedStatusKey featuresCapabilitiesPricingVideo lengthSupported languages
SynthesiaSynthesia2019ReleasedAI avatars, multilingual support for 60+ languages, customization options[40]Specialized in realistic AI avatars for corporate training and marketing[40]Subscription-based, starting around $30/monthVaries based on subscription60+
InVideo AIInVideo2021ReleasedAI-powered video creation, large stock library, AI talking avatars[40]Tailored for social media content with platform-specific templates[40]Free plan available, Paid plans starting at $16/monthVaries depending on content typeMultiple (not specified)
FlikiFliki AI2022ReleasedText-to-video with AI avatars and voices, extensive language and voice support[40]Supports 65+ AI avatars and 2,000+ voices in 70 languages[40]Free plan available, Paid plans starting at $30/monthVaries based on subscription70+
Runway Gen-2Runway AI2023ReleasedMultimodal video generation from text, images, or videos[41]High-quality visuals, various modes like stylization and storyboard[41]Free trial, Paid plans (details not specified)Up to 16 secondsMultiple (not specified)
Pika LabsPika Labs2024BetaDynamic video generation, camera and motion customization[42]User-friendly, focused on natural dynamic generation[42]Currently free during betaFlexible, supports longer videos with frame continuationMultiple (not specified)
Runway Gen-3 AlphaRunway AI2024AlphaEnhanced visual fidelity, photorealistic humans, fine-grained temporal control[43]Ultra-realistic video generation with precise key-framing and industry-level customization[43]Free trial available, custom pricing for enterprisesUp to 10 seconds per clip, extendableMultiple (not specified)
OpenAI SoraOpenAI2024AlphaDeep language understanding, high-quality cinematic visuals, multi-shot videos[44]Capable of creating detailed, dynamic, and emotionally expressive videos; still under development with safety measures[44]Pricing not yet disclosedExpected to generate longer videos; duration specifics TBDMultiple (not specified)

Text-to-image model

 A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom, as a result of advances in deep neural networks. In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2Google Brain's Imagen, Stability AI's Stable Diffusion, and Midjourney—began to be considered to approach the quality of real photographs and human-drawn art.

Text-to-image models are generally latent diffusion models, which combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.[1]

History

[edit]

Before the rise of deep learning,[when?] attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art.[2][3]

The inverse task, image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models.[4]

The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences.[4] Images generated by alignDRAW were in small resolution (32×32 pixels, attained from resizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the training set.[4][5]

Eight images generated from the text prompt A stop sign is flying in blue skies. by AlignDRAW (2015). Enlarged to show detail.[6]

In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task.[5][7] With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details.[5] Later systems include VQGAN-CLIP,[8] XMC-GAN, and GauGAN2.[9]

DALL·E 2's (top, April 2022) and DALL·E 3's (bottom, September 2023) generated images for the prompt A stop sign is flying in blue skies

One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021.[10] A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022,[11] followed by Stable Diffusion that was publicly released in August 2022.[12] In August 2022, text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by textual inversion, namely, finding a new text term that correspond to these images.

Following other text-to-image models, language model-powered text-to-video platforms such as Runway, Make-A-Video,[13] Imagen Video,[14] Midjourney,[15] and Phenaki[16] can generate video from text and/or text/image prompts.[17]

Architecture and training

[edit]
High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with a recurrent neural network such as a long short-term memory (LSTM) network, though transformer models have since become a more popular option. For the image generation step, conditional generative adversarial networks (GANs) have been commonly used, with diffusion models also becoming a popular option in recent years. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details.

Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach.[18]

Datasets

[edit]
Examples of images and captions from three public datasets which are commonly used to train text-to-image models

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is the COCO dataset. Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects with five captions per image, generated by human annotators. Originally, the main focus of COCO was on the recognition of objects and scenes in images. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets because of their narrow range of subject matter.[7]

One of the largest open datasets for training text-to-image models is LAION-5B, containing more than 5 billion image-text pairs. This dataset was created using web scraping and automatic filtering based on similarity to high-quality artwork and professional photographs. Because of this, however, it also contains controversial content, which has led to discussions about the ethics of its use.

Some modern AI platforms not only generate images from text but also create synthetic datasets to improve model training and fine-tuning. These datasets help avoid copyright issues and expand the diversity of training data.[19]

Quality evaluation

[edit]

Evaluating and comparing the quality of text-to-image models is a problem involving assessing multiple desirable properties. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement.[7]

A common algorithmic metric for assessing image quality and diversity is the Inception Score (IS), which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the related Fréchet inception distance, which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model.[7]

Impact and applications

[edit]
AI has the potential for a societal transformation, which may include enabling the expansion of noncommercial niche genres (such as cyberpunk derivatives like solarpunk) by amateurs, novel entertainment, fast prototyping,[20] increasing art-making accessibility,[20] and artistic output per effort or expenses or time[20]—e.g., via generating drafts, draft-definitions, and image components (inpainting). Generated images are sometimes used as sketches,[21] low-cost experiments,[22] inspiration, or illustrations of proof-of-concept-stage ideas. Additional functionalities or improvements may also relate to post-generation manual editing (i.e., polishing), such as subsequent tweaking with an image editor.[22]

List of notable text-to-image models

[edit]
NameRelease dateDeveloperLicense
DALL-EJanuary 2021OpenAIProprietary
DALL-E 2April 2022
DALL-E 3September 2023
Ideogram 2.0August 2024Ideogram
ImagenApril 2023Google
Imagen 2December 2023[23]
Imagen 3May 2024
PartiUnreleased
FireflyMarch 2023Adobe Inc.
MidjourneyJuly 2022Midjourney, Inc.
Stable DiffusionAugust 2022Stability AIStability AI Community License[note 1]
FluxAugust 2024Black Forest LabsApache License[note 2]
AuroraDecember 2024xAIProprietary
RunwayML2018Runway AI, Inc.Proprietary