Sora - 文本转视频模型（text-to-video model）

OpenAI 2024年2月15日（美国当地时间）发布Sora，文本转视频模型。Sora可以生成长达一分钟的视频，同时保持视觉质量并遵守用户的提示词。

Today, Sora is becoming available to red teamers to assess critical areas for harms or risks. We are also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals.
今天，红队可以利用Sora来评估关键区域的危害或风险。我们还允许一些视觉艺术家、设计师和电影制作人访问，以便获取关于如何改进该模型以对创意专业人士更有帮助的反馈。

We’re sharing our research progress early to start working with and getting feedback from people outside of OpenAI and to give the public a sense of what AI capabilities are on the horizon.
我们正在积极分享我们的研究成果，旨在尽早开启与OpenAI之外的专家合作，并从他们那里收集反馈，以使公众对即将推出的AI功能有所了解。

Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.
Sora能够生成包含多个角色、特定类型的动作以及准确的主题和背景细节的复杂场景。这个模型不仅理解用户在提示中所要求的内容，还理解这些事物在物理世界中的存在方式。

The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.
该模型对语言有深入的理解，能够准确解读提示并生成表达鲜明情感的引人入胜的角色。Sora还可以在单个生成的视频中创建多个镜头，准确地保持角色和视觉风格的一致性。

The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.
当前模型存在弱点。它可能在准确模拟复杂场景的物理特性方面存在困难，也可能不理解特定的因果关系实例。例如，一个人可能咬了一口饼干，但之后饼干可能没有咬痕。

The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.
模型也可能混淆提示词中的空间细节，例如左右混淆，也可能在描述随时间发生的事件的精确细节方面存在困难，如遵循特定的摄像机轨迹。

Safety 安全

We’ll be taking several important safety steps ahead of making Sora available in OpenAI’s products. We are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be adversarially testing the model.
在将Sora纳入OpenAI产品之前，我们将采取几项重要的安全措施。我们正在与红队成员合作——这些是在错误信息、仇恨内容和偏见等领域的领域专家——他们将对模型进行敌对测试。

We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product.
我们还在开发工具以帮助检测误导性内容，例如可以识别由Sora生成的视频的检测分类器。如果我们在OpenAI产品中部署模型，我们计划将来包括C2PA元数据。

In addition to us developing new techniques to prepare for deployment, we’re leveraging the existing safety methods that we built for Sora. Our products that use DALL·E 3, which are applicable to Sora as well.
除了我们开发新技术以准备部署外，我们还在利用我们为使用DALL·E 3的产品构建的现有安全方法，这些方法也适用于Sora。

For example, once in an OpenAI product, our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others. We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies, before it’s shown to the user.
例如，一旦在OpenAI产品中，我们的文本分类器将检查并拒绝违反我们使用政策的文本输入提示，如那些请求极端暴力、性内容、仇恨图像、名人肖像或他人的IP的提示。我们还开发了强大的图像分类器，用于审查每个生成的视频的帧，以帮助确保它遵守我们的使用政策，在显示给用户之前。

We’ll be engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.
我们将与世界各地的政策制定者、教育工作者和艺术家进行接触，以了解他们的担忧，并确定这项新技术的积极用例。尽管进行了广泛的研究和测试，但我们无法预测人们将以所有有益的方式使用我们的技术，也无法预测人们将以所有方式滥用它。这就是为什么我们认为，从现实世界的使用中学习是随着时间的推移创建和发布越来越安全的AI系统的关键组成部分。

Research techniques 研究技术

Sora is a diffusion model, which generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps.
Sora是一种扩散模型，它通过从类似于静态噪声的视频开始，逐步通过多个步骤去除噪声来生成视频。

Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.
Sora能够一次性生成整个视频，或扩展已生成的视频使其更长。通过让模型一次预见多个帧，我们解决了一个挑战性问题，确保即使主体暂时离开视线，也能保持不变。

Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.
与GPT模型类似，Sora采用变换器架构，释放了卓越的扩展性能。

We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.
我们将视频和图像表示为称为“补丁”的较小数据单元的集合，每个补丁类似于GPT中的一个标记。通过统一我们表示数据的方式，我们可以在比以前可能的更广泛的视觉数据上训练扩散变换器，涵盖不同的持续时间、分辨率和宽高比。

Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully.
Sora基于DALL·E和GPT模型的过去研究。它使用了DALL·E 3的重描述技术，即为视觉训练数据生成高度描述性的标题。因此，模型能够更忠实地遵循用户在生成的视频中的文本指令。

In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail. The model can also take an existing video and extend it or fill in missing frames. Learn more in our technical report.
除了能够仅从文本指令生成视频外，该模型还能够接受现有的静态图像并从中生成视频，准确地并注重细节地动画化图像内容。模型还可以接受现有视频并扩展它或填充缺失帧。在我们的技术报告中了解更多信息。

Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.
Sora为能够理解和模拟现实世界的模型奠定了基础，我们认为这将是实现通用人工智能的一个重要里程碑。（ from OpenAI）

Sora – 文本转视频模型（text-to-video model）

Safety 安全

Research techniques 研究技术

Comments

Leave a Reply Cancel reply