KVQ:KwaiVideoQualityAssessmentforShort-formVideos资源-CSDN下载

视频质量评估

106 浏览量 2024-10-10 20:08:43 上传评论收藏 14.04MB PDF 举报

资源推荐

资源详情

资源评论

KVQ: Kwai Video Quality Assessment for Short-form Videos

Yiting Lu

, Xin Li

, Yajing Pei

1,2

, Kun Yuan

2†

Qizhi Xie

2,3

, Yunpeng Qu

2,3

, Ming Sun

, Chao Zhou

, Zhibo Chen

1†

University of Science and Technology of China,

Kuaishou Technology,

Tsinghua University

{luyt31415,lixin666,peiyj}@mail.ustc.edu.cn, [email protected]

{yuankun03,xieqizhi,quyunpeng,sunming03,zhouchao}@kuaishou.com

Person Scenario

(a) Special Effect Video

Portrait Scenario

(b) Subtitled Video

Crowd Scenario

Food Scenario Stage Scenario

CG Scenario

(III) Transcoding(II) Pre-Processing

Before

After

(I) Enhancement

Caption Scenario

Night Scenario

Landscape Scenario

Global-Level

ROI-Level

Before

After

De-Artifact

Before

After

Before

De-Noise

After Before After

Transcode QP30

Before

De-Blur

After Before

Transcode QP42

(d) Live Video

Portrait Scenario

Person Scenario

Figure 1. The two primary challenges of short-form videos: the kaleidoscope content with various creation modes (top) and complicated

distortion arising from sophisticated video processing workﬂows (bottom). Regions with distortions are indicated by red boxes.

Abstract

Short-form UGC video platforms, like Kwai and TikTok,

have been an emerging and irreplaceable mainstream me-

dia form, thriving on user-friendly engagement, and kalei-

doscope creation, etc. However, the advancing content-

generation modes, e.g., special effects, and sophisticated

processing workﬂows, e.g., de-artifacts, have introduced

signiﬁcant challenges to recent UGC video quality assess-

ment: (i) the ambiguous contents hinder the identiﬁcation

of quality-determined regions. (ii) the diverse and compli-

cated hybrid distortions are hard to distinguish. To tackle

the above challenges and assist in the development of short-

form videos, we establish the ﬁrst large-scale Kaleidoscope

short Video database for Quality assessment, termed KVQ,

which comprises 600 user-uploaded short videos and 3600

processed videos through the diverse practical processing

workﬂows, including pre-processing, transcoding, and en-

hancement. Among them, the absolute quality score of

each video and partial ranking score among indistinguish-

able samples are provided by a team of professional re-

Equal contribution

†

Corresponding authors

searchers specializing in image processing. Based on this

database, we propose the ﬁrst short-form video quality eval-

uator, i.e., KSVQE, which enables the quality evaluator to

identify the quality-determined semantics with the content

understanding of large vision language models (i.e., CLIP)

and distinguish the distortions with the distortion under-

standing module. Experimental results have shown the ef-

fectiveness of KSVQE on our KVQ database and popular

VQA databases. The project can be found at https:

//lixinustc.github.io/projects/KVQ/.

1. Introduction

Recent years have witnessed the signiﬁcant advancement of

short-form UGC video platforms, where billions of users

have actively engaged in uploading and sharing their user-

generated content (UGC) videos that encompass personal

life, professional skills, and education, etc. Different from

traditional video platforms, such as YouTube, short-form

video platform aims to simplify content creation for users

and enhance the accessibility and conciseness of video con-

tent for viewers by limiting the video length, which achieves

great success since their mobile-friendly broadcasting, user-

friendly engagement, kaleidoscope content creation, and

arXiv:2402.07220v2 [eess.IV] 20 Feb 2024

snackable content. Despite that, the variable and uncer-

tain subjective quality caused by non-professional shoot-

ing [9, 60] or bitrate constrain [12, 38, 64] urgently entails

the development of the video quality assessment (VQA) tai-

lored for the short-form UGC (S-UGC) videos.

Recently, most existing databases [13, 47, 51, 56, 67]

and associated studies [5, 28, 46, 53, 54, 68, 69] for the

UGC video quality assessment are contributed for the in-

the-wild UGC videos from general media platforms (e.g.,

Youtube). And these excellent databases can be divided into

two main streams. One of the streams [13, 51, 67] merely

focused on the quality of UGC videos acquired from tra-

ditional stream media clients. Another line of these UGC

databases [26, 74] delved into the impact of compression

on UGC videos. In contrast, there are two primary chal-

lenges for the quality assessment of S-UGC videos that

prevent the application of existing UGC methods: (i) the

presence of various special creation/generation modes, e.g.,

special effects (Please see Fig. 1) and kaleidoscope con-

tents, including portrait, landscape, food, etc, which con-

fuses and impede the VQA models to accurately identify

the quality-determined region/contents. (ii) sophisticated

processing ﬂow, e.g., transcoding and enhancement, along

with intricate distortions existing in user-uploaded videos,

which presents signiﬁcant difﬁculties for the VQA model

in distinguishing and determining the video quality.

To further improve the quality assessment of S-UGC

videos, we establish the ﬁrst large-scale kaleidoscope short-

form video database named KVQ. In particular, 4200 S-

UGC videos are collected to cover the primary creation

modes (e.g., special effect and three-stage form) and content

scenarios (e.g., food, stage, night, and so on) in the popu-

lar short-from UGC video platform, which is composed of

600 user-uploaded S-UGC videos and 3600 processed S-

UGC videos via several practical video processing work-

ﬂows [4, 29, 59, 62] (e.g., pre-processing, enhancement,

transcoding). Notably, the selection of content and process-

ing strategies are determined by practical statistics in the

popular S-UGC platform, which is signiﬁcant for the devel-

opment and measurement of S-UGC VQA. To provide accu-

rate annotation for KVQ, a team of professional researchers

specializing in image processing is responsible for the qual-

ity labeling of each S-UGC video with the range of [1-5]

and the interval of 0.5. Despite that, there are still some

videos with similar subjective quality, which makes it hard

to distinguish which is better. To empower our KVQ with

more ﬁne-grained quality estimation capability, we select

500 indistinguishable S-UGC video pairs and provide their

ranked annotations, which are not considered by existing

UGC datasets.

Based on our KVQ benchmark, we introduce the ﬁrst

Kaleidoscope Short-form UGC Video Quality Evaluator

(KSVQE). In particular, to identify the quality-determined

regions and mitigate the impacts of quality-unrelated con-

tent, it is necessary to enhance the content understanding

capability of our KSVQE. Considering the powerful ﬁne-

grained semantic understanding capability of pre-trained

large vision-language model, CLIP [39], we propose the

quality-aware region selection module (QRS) and content-

adaptive modulation (CaM) for KSVQE. In QRS, the learn-

able quality adapter is introduced to adapt the ﬁne-grained

semantics from pre-trained CLIP as the guidance to identify

the quality-determined regions and keep it, while dropping

the quality-unrelated contents. The CaM is introduced to

enable our KSVQE to perceive the content semantics for

each region, since the subjective quality is also associated

with different contents. To address the indistinguishabil-

ity of distortions in S-UGC videos caused by video shoot-

ing and sophisticated processing workﬂows, we enhance the

distortion understanding and adaptation capability of our

KSVQE, by incorporating the distortion prior captured with

the distortion-aware model CONTRIQUE [35]. Here, the

CONTRIQUE is efﬁciently ﬁne-tuned toward the distortion

distribution of our KVQ database with a distortion adapter

under the contrastive loss function. With the above innova-

tions, our KSVQE achieves state-of-the-art performance on

our proposed KVQ dataset, which excessively outperforms

the current best method Dover (retrained with our KVQ) by

0.032 on PLCC and 0.034 on SROCC. Moreover, our pro-

posed KSVQE owns great applicability for the commonly-

used UGC-VQA datasets. The contributions of this paper

are summarized below:

• We built the ﬁrst large-scale kaleidoscope short-form

video database, termed KVQ, which is composed of 4200

user-uploaded or processed short-form videos collected

from the popular short-form UGC video platform. The

reliable absolute quality label and partial ranked label for

indistinguishable samples are annotated by a group of

professional researchers specializing in image processing.

• We propose the ﬁrst kaleidoscope short-form video qual-

ity evaluator, termed KSVQE, to solve two primary

challenges in KVQ: (i) unidentiﬁed quality-determined

region/content caused by various creation/generation

modes and kaleidoscope content scenarios. (ii) indistin-

guishable distortions caused by sophisticated processing

ﬂows and unprofessional video shooting.

• To enable the content understanding capability of

KSVQE, we propose the quality-aware region selection

module (QRS) and content-adaptive modulation (CaM)

based on the pre-trained large vision-language model,

CLIP. Apart from that, we enhance the distortion under-

standing of KSVQE by designing the distortion-aware

modulation (DaM) via a pre-trained distortion extractor.

• The thorough analysis of our KVQ is provided and

extensive experiments on our proposed KVQ and the

commonly-used UGC VQA datasets have shown the ef-

fectiveness and applicability of our proposed KSVQE.

2. Related Work

2.1. UGC-VQA databases

In recent years, to develop more realistic and challenging

video quality assessment (VQA) for user-generated content

(UGC), many UGC databases [11, 37, 44, 47, 56, 67, 74],

have been established collecting videos with authentic dis-

tortions. These databases can be categorized into two types

based on their collection scope. The ﬁrst category [13, 51,

67]contains UGC databases collected from the real-world

media platform. Notably, LSVQ [67] includes a substan-

tial 39,076 videos. The Second category [26, 74] involves

UGC databases with simulated distortions approximating

realistic online video platforms, containing both originally

distorted and post-compressed videos. Our proposed KVQ

database, gathered from a short video platform, is similar

to the Second category but has two key differences. Firstly,

KVQ focuses extensively on short-form videos with vari-

ous creation modes and kaleidoscope content. Secondly,

KVQ underwent sophisticated video processing workﬂows

involving pre-processing, enhancement, and transcoding.

2.2. UGC-VQA methods

There are two main streams for user-generated content

video quality assessment (UGC-VQA) [7, 17, 28, 41, 46,

53, 54, 58, 68–71]. The ﬁrst comprises traditional meth-

ods [18, 19, 36, 41], which are constrained by the limita-

tions of handcrafted features and lack of the adaptability to

handle more complex UGC databases. With the advance-

ment of deep learning, the second stream learning-based

methods often enable superior performance, which can be

categorized into three main types: temporal fusion, multi-

priors fusion, and fragment extraction. Temporal fusion-

based methods [5, 21, 55, 69] aim to adaptively fuse quality

features in the temporal domain. Multi-priors based meth-

ods [20, 30, 46, 74] typically incorporate multi-priors into

quality-aware features for ﬁnal regression. Fragment-based

methods [52, 53] extract texture-level information and elim-

inate substantial spatio-temporal redundancies. However,

above methods do not incorporate the ability of content-

distortion understanding into the feature extraction process,

which hinders their capability to address the two challenges

in short-form video platforms.

3. Our proposed KVQ Database

To advance the progress of short-form video quality as-

sessment, we built the ﬁrst large-scale KVQ database, in-

tending to assist the algorithm development. In contrast

to traditional UGC VQA databases [13, 37, 51, 67], our

KVQ database exhibits the following distinctive features

and advantages: (i) special but crucial application sce-

nario, i.e., short-form video platform, (ii) advancing con-

tent creation/generation modes and kaleidoscope contents,

(iii) practical and sophisticated processing workﬂows, (iv)

unique scoring strategy, i.e., the combination of absolute

and ranking quality score. In the following sections, we

will clarify the above features/advantages in detail.

3.1. Dataset Collection

Our dataset is composed of 4200 S-UGC videos, which

is collected following two principles: (i) ensure the con-

tent diversity and distortion diversity as much as possi-

ble and (ii) satisfy the practical online statistics and appli-

cation/requirements in the popular short-form video plat-

forms. The pipeline of our dataset collection is shown

in Fig. 11. Notably, in practical application, the previ-

ous UGC-VQA methods usually perform poorly for content

generated with advancing creation modes, such as special

effects. Considering that, we collect the datasets from sev-

eral typical creation modes, including three-stage, special

effects, subtitled, live modes (Please see Fig. 1), and other

traditional creation modes. The data are composed of nine

primary content scenarios in the practical short-form video

platform, including landscape, crowd, person, food, por-

trait, computer graphic (termed as CG), caption, and stage.

In this way, these original user-uploaded data contents cover

almost all existing creation modes and scenarios, and the ra-

tio of each category of content satisﬁes the practical online

statistics. To further align the video features in the practical

platform, we make ﬁne-grained video content adjustments

based on typical 6 video features, i.e., sharpness, complex-

ity, blurriness, noise, blocky, and colorfulness. Based on

the above collection strategies, we collect 600 original user-

uploaded S-UGC videos for next-stage processing.

Most UGC databases, e.g., UGC-VIDEO [26], simulate

the video processing pipeline for UGC videos with single or

simple processing tools, such as transcoding. However, in

practical short-form video platforms, the video processing

pipeline is sophisticated, including different pre-processing,

transcoding, and enhancement tools, intending to enhance

the subjective quality and reduce the coding bitrate. More-

over, the video processing pipeline is adaptive for each

video based on its content and quality. Therefore, to build

an applicable database, we exploit the representative video

processing strategy in a practical short-form video platform

for our KVQ database, which is shown in Fig. 11, where

enhancement ϕ

(·), pre-processing ϕ

(·), and transcoding

(·) work in a cascaded manner. Concretely, 50% of high-

quality videos are processed with six transcoding modes,

since they do not need enhancement and pre-processing.

Another 50% of low-quality videos select one enhance-

ment tool from tool pools of de-artifacts, denoise, and de-

blur. Then the pre-processing is made with a probability

of 0.5 for enhanced low-quality data, followed by transcod-

ing. In this way, 3600 processed S-UGC videos are ob-

Figure 2. The overview for establishing the KVQ dataset involves several key steps. Initially, we collect the original short-form videos to

cover the primary creation modes and content scenarios. Subsequently, we make ﬁne-grained video content adjustments based on the 6

video features. Finally, sophisticated video processing workﬂows are applied to incorporate various hybrid distortions.

tained, which can be divided into three groups correspond-

ing to three typical working ﬂows, i.e., ϕ

(·), ϕ

(ϕ

(·)) and

(ϕ

(·))). Based on the above collection strategy, we

collect 4200 S-UGC videos as our database. No questions

on licenses existed in this work since the data collection is

authorized by the short-form video platform and owners.

3.2. Human Study

The human study is carried out with 15 professional re-

searchers specializing in image processing in the standard

environment for quality assessment. Despite the profes-

sional labeling, it is still hard to achieve ﬁne-grained ab-

solute scoring with single-stimulus (SS) methods [14]. To

enable the ﬁne-grained evaluation capability, we propose

mixed scoring, where the absolute Mean Opinion Score

(MOS) value is provided for each video with the range of

[1-5] and the interval of 0.5, and the ranking score is pro-

vided for the indistinguishable S-UGC videos. For the ab-

solute MOS value, we follow the standard subjective pro-

cedure in ITU-R BT 500.13 [3]. Each participant is given

the training with uniﬁed instruction. After scoring, the data

cleaning process is performed for each video.

We notice that there are two representative indistinguish-

able scenarios. The ﬁrst scenario occurs for different video

contents (i.e., non-homogeneous video pairs), where the

difference of MOSs is less than 0.5. Another scenario is

that the transcoding levels do not match their assessed qual-

ity order for the same content (i.e., the homogeneous video

pairs) since the adaptive enhancement and preprocessing.

Therefore, to improve the ﬁne-grained evaluation capabil-

ity, we select 250 homogeneous video pairs and 250 non-

homogeneous video pairs for ranking labeling.

Figure 3. The MOS distribution of different semantic categories

(a) and the histogram of the overall MOS distribution (b).

3.3. Subjective Quality Analysis

In this subsection, we conduct a thorough analysis of the

subjective quality score for our KVQ. Speciﬁcally, we visu-

alize the MOS distribution for 9 content scenarios in Fig. 3.

We can observe that the MOS distributions of different con-

tents are similar except for the night and stage scenarios,

due to that the dark night scenario and complex stage mo-

tion are prone to cause a bad perception experience.

To investigate the impacts of different processing work-

ﬂows on subjective quality, we visualize the MOS distribu-

tion of three video groups. As stated in section 3.1, based

on the distortions in 600 original S-UGC videos, we can

divide it into three groups, where the high-quality video

group 1 is only processed with different transcoding modes.

From Fig. 5, we can observe that the subjective quality will

decrease with the QP increases since the compression ar-

tifacts increase. By comparing the subjective quality of

original videos and the processed ones in the ﬁrst and sec-

ond QP intervals in Video Group 2 (i.e., processed with en-

hancement and transcoding), we can ﬁnd that the enhance-

ment tools can improve the subjective quality effectively de-

剩余18页未读，继续阅读

评论收藏

内容反馈

码流怪侠

粉丝: 4w+

KVQ: Kwai Video Quality Assessment for Short-form Videos

go-kvq:kvq 是一个 LevelDB 支持的事务队列

[精品]河南中医学院重点学科信息管理系统使用说明.doc

bcrypt_elixir:Elixir的Bcrypt密码哈希

DeepSeek从入门到精通-清华大学-202502.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

DEEP SEEK 本地部署（Ollama + ChatBox）+ 私有知识库（cherry studio）教程

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

DeepSeek从入门到精通-清华大学

CIFAR10数据集免费下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

清华deepseek入门到精通文档 夸克网盘资源下载

Deep Learning Tuning Playbook（中译版）

LabVIEW AI Vision(LabVIEW AI视觉工具包)

zotero翻译插件.xpi

YOLOv5 人脸口罩图片数据集

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

免费Ollama 官方大模型服务器安装程序

人工智能应用：DeepSeek从入门到精通的操作指南与多功能实战详解

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

Ollama windows安装包 0.5.7（截止2025-02-01）

mamba、causal-conv1d安装.whl文件

YOLOv8目标追踪实战全套资源包 - 源码与数据集完整分享

皮肤病语义分割数据集+代码+unet模型 2000张标注好的数据+教学视频

【大作业-08】YOLOV5火灾检测数据集+代码+模型 2000张标注好的数据+教学视频

时间序列预测实战(十九)魔改Informer模型进行滚动长期预测（科研版本，结果可视化）

基于YOLOv5实现垃圾分类目标检测

Deep Multi-Task Multi-Channel Learning for Alzheimer‘s Disease Diagnosis（阿尔兹海默症诊断）

基于javaweb程序开发交流系统

最新资源

清华deepseek入门到精通文档夸克网盘资源下载