SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

바세린용자 (yongja)

Jan 24, 2026

SnapFusion: Text-to-Image Diffusion Model on Mobile
Devices within Two Seconds

task

on mobile(온모바일)에서 이미지 생성하기

Video preview

기존 연구의 한계

온모바일 이미지 생성에서 병목인 구간은 UNet임
하지만 기존 연구들은 UNet 구조에 맞는 경량화를 하는 것이 아니라, pruning, architecture search 같은 보편적인 Post training 방식이었음

이러한 방식을은 모델 구조를 변형시켜 fine tuning을 오랜시간 해야 함

SnapFusion 성능

notion image

전체 속도에서 UNet이 병목이며, step size가 기존에는 50이라 더더욱 느림
SnapFusion은 latency와 step 수 모두 줄임

UNet의 구조

notion image

notion image

Efficient UNet

pruning이나 architecture search를 하면 모델 구조가 변해서 성능이 떨어질 수 있고, 이걸 회복하려면 fine tuning이 필요함
Robust Training

notion image

elastic depth에서 아이디어를 가져왔다고 함
training augmentation이라고 표현
확률에 따라 Cross Attention 또는 ResNet 블록을 identity mapping으로 대체 (skip)

나중에 pruning이나 architecture search로 구조가 바뀌어도, 구조 변경에 robust 해진다고 함

Evaluation and Architecture Evolving

아래와 같은 action set을 만들고 평가를 해봄

notion image

+는 해당 블록 수행, -는 안 함

ΔCLIP/ΔLatency을 계산해서, 점수가 크면 속도 대비 품질에 기여를 많이 하니까 keep. 점수가 낮으면 속도 대비 품질 기여가 낮으니까 제거 후보

Latency는 아래와 같이 계산하여 참고함

notion image

알고리즘

notion image

notion image

robust traininng을 해야 cross attention 블록을 없애도 이미지를 프롬프트에 맞게 생성할 수 있음

그리고 어느 블록을 없애느냐에 따라 성능이 달라짐

Efficient Image Decoder

50% uniform channel pruning: 디코더의 각 conv 레이어에서 채널 수를 절반으로 줄임

conv 파라미터/연산량(MACs)은 대략 conv in × conv out에 비례하므로, conv in과 conv out을 둘 다 반으로 줄이면 0.5×0.5=0.25가 되어 MACs가 약 1/4이 됨
Stable Diffusion 모델 디코더에 비해서 파라미터수는 3.8× 감소, 속도 3.2× 향상

Step Distillation

notion image

먼저 stable diffusion을 ε(noise)예측에서 v(velocity) 예측 모델로 바꿈

notion image

기존 연구에서 step distillation은 v prediction 모델에서 더 잘 동작했다 함
v는 노이즈에서 원본(clean)을 뺀 것

notion image

그리고 distillation 수행

vanilla distillation

notion image

실험적으로 아래 순서로 하는게 좋았다고 함

teacher: Stable Diffusion 1.5 (원래 50step)
student: Efficient UNet
teacher가 32step 하면 student는 16 step으로 distill
teacher가 16step하면 student는 8 step으로 distill

CFG-aware step distillation

notion image

vanilla kd 하면 FID(이미지 생성 품질)은 안 떨어지는데 CLIP 점수(프롬프트에 따라 생성하는 것)이 떨어짐
그래서 classifier free guidance 버전으로 unet 출력 대체

loss mixing

notion image

FID를 위해서는 vanilla kd가, CLIP score를 위해서는 CFG-aware step distillation이 좋기 때문에 확률적으로 이 둘을 섞어서 학습시킴
아직 cfg aware step distillation은 연구가 한 개인가 밖에 없음

그 연구에서는 vanilla와 cfg aware kd를 항상 하기 때문에 학습 평가 비용이 더 큼
여기선 둘 중 하나만 하기 때문에 학습 평가 비용이 더 적음
그리고 vanilla와 cfg aware kd는 FID와 CLIP score의 trade off가 있기 때문에, 이 논문에서는 이 부분을 중점을 둠

실험 결과

notion image

다른 모델들보다 FID가 좋고, CLIP은 살짝 나쁨

다른 모델들은 SD 1.5구조라 훨씬 느리다 함

notion image

distillation을 하면 성능이 좋아짐

notion image

(a) direct vs progressive 큰 차이 없음
(b) 기존 연구(w-conditioned)보다 우리께 더 성능이 좋다
(c)(d) FID, CFG aware distill의 trade-off 볼 수 있음

한계

모델 사이즈는 아직도 큼. 모델 사이즈를 줄이면 더 빨라질 수 있을지도
아이폰 14 pro로 실험함. 컴퓨팅파워가 좋은 편이기 때문에, 다른 모델에서는 잘 안 될 수도 있음

Share article

kjyong

RSS·Powered by Inblog