By Kent Yang | Staff Writer
Imagine a technology so advanced that words can conjure breathtaking worlds, complete with sights and sounds mirroring reality. Well, now it’s possible, or at least it will be soon, with Sora by OpenAI.
OpenAI is an AI research and deployment company that aims to ensure artificial intelligence will benefit all of humanity. Sora is an artificial intelligence model capable of translating text into videos up to 1 minute long, simulating the physics and dynamics of our reality. This AI model not only comprehends text but also grasps its nature and how it exists in the real world too.
Before delving too deeply, here’s a fun fact: in Japanese, “Sora” means “Sky” and often carries metaphorical and spiritual meanings such as transcendence, boundlessness, freedom, and the infinite expansion of the universe. With Sora and the future of AI, the potential is limitless.
From the demos of Sora, it appears almost perfect. However, like all early AI models, Sora has its flaws. Users with early access have reported issues such as a lack of attention to detail, inaccurate motion representation, unrealistic transformations, difficulties with complex scenes, and unnatural facial expressions.
So, how does Sora really work? With its massive database filled with high-quality images, videos, and highly descriptive captions, Sora constantly learns and analyzes these materials using nodes. The nodes in this neural network help teach AI models like Sora to process data in a manner resembling the way a human brain processes and understands information. Thus, when Sora encounters a written prompt from a user, it generates a video response by utilizing various models or dedicated programs with specific functions to find patterns or make decisions based on previous datasets.
One model in particular is called Cascade Diffusion. It starts by creating a basic version of an image or video, then gradually improves its quality through an iterative process until it reaches a high-resolution output. Another model, Latent Diffusion, compresses the image into a lower-dimensional latent space, then uses encoders to adjust the latent variables, enhancing its quality in the process. Once finished, it is decoded into a high-resolution image. In AI, a latent space represents the characteristics of a dataset.
Sora also employs Diffusion Transformers, known for their flexibility and scalability in handling data and computing. In this context, a transformer is a type of neural network designed to convert one sequence of information into another. Diffusion Transformers approach problem-solving like solving a giant puzzle. Instead of tackling it all at once, the model breaks it down into smaller pieces, making estimates and gradual refinements until the puzzle is solved. These refinements include adding and removing noise to learn how to reverse the process and construct the next part.
Aside from generating text-to-video, with its versatility, Sora can transform still photos into videos, create looped sequences, and merge elements from two videos into one. As of March 2024, Sora has not been released to the public, but according to Mira Murati, OpenAI’s Chief Technology Officer, it will be available later this year. As AI technology evolves, Sora stands at the forefront of text-to-video-based artificial intelligence. Perhaps someday, Sora will enable text-to-videos beyond the 1-minute limit and even lead to full-quality movies. However, this advancement may bring its own challenges in data ethics, a discussion for another time.
Comments are closed.