MagicVideo: Efficient Video Generation With Latent Diffusion Models

바세린용자 (yongja)

Mar 29, 2026

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Task

text prompt를 따르는 비디오 만들기

어려운 점

비디오 - 텍스트 데이터셋이 별로 없음
이미지보다 복잡함(dynamic 함)
이미지보다 computation cost가 매우 높음

기존 방식들

저해상도로 비디오 만든 후 업스케일링

하지만 저해상도로 만드는 것도 계산량이 많이 필요

For example, when generating a coarse video clip of 16 frames and 64 × 64 resolution, the recent video diffusion model [13] would take 6-10 seconds with 75G GPU memory1 for each diffusion iteration.

처음 LDM을 적용한 비디오 디퓨전 모델

모델 구조

notion image

2D + adapter 구조

3D conv는 계산량이 많이 필요함
그래서 기존에는 2D (이미지 디퓨전) + 1D (temoral dimension) 구조로 함
이 논문에서는 2D + adaptor 방식 시도

video distribution adaptor

notion image

F개의 프레임의 이미지에 각각 2d conv를 적용하여 spatial feature 뽑음
그리고 매 프래임/채널마다 feature의 평균과 분산을 조정(adjust)해 줌

왜냐하면 하나의 비디오 클립 내에서 의미적인(semantic)한 부분은 서로 비슷하기 때문에, 작은 차이는 큰 의미가 없을 수도 있기 때문.
근데 모든 비디오의 특정 frame, channel에 동일한 S, B를 적용하는거라 큰 의미가 있는지는 말 모르겠음

spatial and direct temporal attention

notion image

notion image

directed temporal attention

notion image

notion image

프레임별 이미지의 특정 위치를 시간순으로 시퀀스로 봄
이 시퀀스에 temporal self attention을 걺
근데 양방향이 아니라 미래는 못 보게 함

lower triangular mask
비디오의 특성을 따른 inductive bias

Training strategy

frame sampling and training objective

notion image

여러 길이의 비디오에서 1/Ls비율로 샘플링해서 16프레임을 만듦

예를 들어 Ls=1이면 16프레임에서 16프레임 전부 사옹
Ls=2면 32프레임에서 2장마다 한 장씩만 사용하여 16프레임 사용

이 샘플링 값인 FPS v를 Sinusoidal embedding해서 모델에 넣어줌

학습하는 비디오의 진행 속도를 넣어주는 의미
positional embedding의 일종

training objective

frame별 loss의 합산

unsupervised training scheme

비디오 - 텍스트 pair 데이터셋은 많이 없음
따라서 사전학습용으로 비디오 프레임을 CLIP에 넣어서 만든 임베딩으로 대체
실제 비디오 - 텍스트 pair로 fine tuning

frame iterpolation

영상을 더 스무스하게 만들기 위하여 프레임 사이를 보간함 (추가 프레임으로 채움)
앞 뒤 프레임을 컨디션으로 넣어서 3 프레임 추가

video vae decoder

notion image

video frame을 vae decoder에 넣어서 rgb로 복원하면 dethering이 생긴다고 함

latent resolution이 낮을 수록 심하게 생긴다고 함

direct temporal attention이 있으면 앞선 프레임의 정보가 있기 때문에 dethering이 줄어든다고 함

super resolution

256 256으로 비디오 만들고 1024 1024로 업샘플

전체 학습 순서

1단계: latent 비디오 생성

key frame generation
interpolation → 비디오의 내용과 motion을 만듦

2단계: VideoVAE decoder

latent를 RGB로 복원
복원 시 temporal consistency를 챙겨 dithering 완화

3단계: super-resolution

256×256 RGB 비디오를 1024×1024로 업샘플
디테일과 선명도 향상

실험

학습 데이터

LDM(Laion 5B pretrained) → 초기화
HD-VILA-100M 일부(1000만 비디오), WebVid-10M → 비지도학습
자체 수집한 700만 video-text pair → 파인튜닝

결과

notion image

notion image

평가 지표

FID

이미지를 feature extractor에 넣고 생성한 이미지/실제 이미지들 사이의 거리를 비교

FVD

비디오를 feature extractor에 넣고 생성한 비디오/실제 비디오 사이의 거리를 비교

zero shot

test 데이터셋의 train 데이터셋은 사용하지 않음

notion image

notion image

+ Directed Attention

미래의 프레임은 못 보고 과거의 프레임만 볼 수 있도록 mask

+ Adapter

프레임별 평균/분산 조정

+Unsupervised pretraining

텍스트-비디오 pair가 부족하니까, 먼저 비디오만 있는 데이터에서 CLIP으로 임베딩 뽑아서 사전학슴

notion image

Share article