没有合适的资源?快使用搜索试试~ 我知道了~
KVQ: Kwai Video Quality Assessment for Short-form Videos
1 下载量 106 浏览量
2024-10-10
20:08:43
上传
评论
收藏 14.04MB PDF 举报
温馨提示
内容概要:本文针对短视频平台(如快手和抖音)中存在的复杂视频创作模式和复杂的加工流程所带来的两大挑战:(1)无法准确识别受复杂视频创作模式影响的质量决定性区域,导致视觉质量评估模型在区分受压缩失真和其他形式扭曲影响的低质量视频时面临困难;(2)由各种因素引入的杂糅失真使扭曲辨别更加困难,提出了一种新的大规模数据库——KVQ以及一种名为KSVQE的评估器。KVQ涵盖了丰富的内容场景并通过复杂的实际处理流水线获取不同混合扭曲;而KSVQE通过集成预训练CLIP对视觉语言模型来增强质量感知区选择模块和地区自适应调制,并用预先训练过的CONTRIQUE提取扭曲特征来提升对扭曲的理解。通过对数据库的广泛实验展示了KSVQE的有效性和泛化能力。 适用人群:从事短形态视频质量评估研究人员和技术爱好者。 使用场景及目标:为短形态视频的制作质量和优化评估提供可靠依据和技术支持,并进一步推动短形态视频领域的研究和发展,解决现有的UGC方法无法应用于视频多样性和复杂扭曲等问题。 其他说明:该研究成果已被实验证实能有效解决短形态UGC视频特有的挑战并在多项测试中获得了优于传统方法的表现,同时在跨数据集评估下显示了较好的迁移能力。
资源推荐
资源详情
资源评论

























KVQ: Kwai Video Quality Assessment for Short-form Videos
Yiting Lu
1
*
, Xin Li
1
*
, Yajing Pei
1,2
*
, Kun Yuan
2†
,
Qizhi Xie
2,3
, Yunpeng Qu
2,3
, Ming Sun
2
, Chao Zhou
2
, Zhibo Chen
1†
1
University of Science and Technology of China,
2
Kuaishou Technology,
3
Tsinghua University
{luyt31415,lixin666,peiyj}@mail.ustc.edu.cn, [email protected]
{yuankun03,xieqizhi,quyunpeng,sunming03,zhouchao}@kuaishou.com
Person Scenario
(a) Special Effect Video
Portrait Scenario
(b) Subtitled Video
Crowd Scenario
Food Scenario Stage Scenario
(c) Three-Stage Video
CG Scenario
(III) Transcoding(II) Pre-Processing
Before
After
(I) Enhancement
Caption Scenario
Night Scenario
Landscape Scenario
Global-Level
ROI-Level
Before
After
De-Artifact
Before
After
After
Before
De-Noise
After Before After
Transcode QP30
Before
De-Blur
After Before
Transcode QP42
(d) Live Video
Portrait Scenario
Person Scenario
Person Scenario
Figure 1. The two primary challenges of short-form videos: the kaleidoscope content with various creation modes (top) and complicated
distortion arising from sophisticated video processing workflows (bottom). Regions with distortions are indicated by red boxes.
Abstract
Short-form UGC video platforms, like Kwai and TikTok,
have been an emerging and irreplaceable mainstream me-
dia form, thriving on user-friendly engagement, and kalei-
doscope creation, etc. However, the advancing content-
generation modes, e.g., special effects, and sophisticated
processing workflows, e.g., de-artifacts, have introduced
significant challenges to recent UGC video quality assess-
ment: (i) the ambiguous contents hinder the identification
of quality-determined regions. (ii) the diverse and compli-
cated hybrid distortions are hard to distinguish. To tackle
the above challenges and assist in the development of short-
form videos, we establish the first large-scale Kaleidoscope
short Video database for Quality assessment, termed KVQ,
which comprises 600 user-uploaded short videos and 3600
processed videos through the diverse practical processing
workflows, including pre-processing, transcoding, and en-
hancement. Among them, the absolute quality score of
each video and partial ranking score among indistinguish-
able samples are provided by a team of professional re-
*
Equal contribution
†
Corresponding authors
searchers specializing in image processing. Based on this
database, we propose the first short-form video quality eval-
uator, i.e., KSVQE, which enables the quality evaluator to
identify the quality-determined semantics with the content
understanding of large vision language models (i.e., CLIP)
and distinguish the distortions with the distortion under-
standing module. Experimental results have shown the ef-
fectiveness of KSVQE on our KVQ database and popular
VQA databases. The project can be found at https:
//lixinustc.github.io/projects/KVQ/.
1. Introduction
Recent years have witnessed the significant advancement of
short-form UGC video platforms, where billions of users
have actively engaged in uploading and sharing their user-
generated content (UGC) videos that encompass personal
life, professional skills, and education, etc. Different from
traditional video platforms, such as YouTube, short-form
video platform aims to simplify content creation for users
and enhance the accessibility and conciseness of video con-
tent for viewers by limiting the video length, which achieves
great success since their mobile-friendly broadcasting, user-
friendly engagement, kaleidoscope content creation, and
1
arXiv:2402.07220v2 [eess.IV] 20 Feb 2024

snackable content. Despite that, the variable and uncer-
tain subjective quality caused by non-professional shoot-
ing [9, 60] or bitrate constrain [12, 38, 64] urgently entails
the development of the video quality assessment (VQA) tai-
lored for the short-form UGC (S-UGC) videos.
Recently, most existing databases [13, 47, 51, 56, 67]
and associated studies [5, 28, 46, 53, 54, 68, 69] for the
UGC video quality assessment are contributed for the in-
the-wild UGC videos from general media platforms (e.g.,
Youtube). And these excellent databases can be divided into
two main streams. One of the streams [13, 51, 67] merely
focused on the quality of UGC videos acquired from tra-
ditional stream media clients. Another line of these UGC
databases [26, 74] delved into the impact of compression
on UGC videos. In contrast, there are two primary chal-
lenges for the quality assessment of S-UGC videos that
prevent the application of existing UGC methods: (i) the
presence of various special creation/generation modes, e.g.,
special effects (Please see Fig. 1) and kaleidoscope con-
tents, including portrait, landscape, food, etc, which con-
fuses and impede the VQA models to accurately identify
the quality-determined region/contents. (ii) sophisticated
processing flow, e.g., transcoding and enhancement, along
with intricate distortions existing in user-uploaded videos,
which presents significant difficulties for the VQA model
in distinguishing and determining the video quality.
To further improve the quality assessment of S-UGC
videos, we establish the first large-scale kaleidoscope short-
form video database named KVQ. In particular, 4200 S-
UGC videos are collected to cover the primary creation
modes (e.g., special effect and three-stage form) and content
scenarios (e.g., food, stage, night, and so on) in the popu-
lar short-from UGC video platform, which is composed of
600 user-uploaded S-UGC videos and 3600 processed S-
UGC videos via several practical video processing work-
flows [4, 29, 59, 62] (e.g., pre-processing, enhancement,
transcoding). Notably, the selection of content and process-
ing strategies are determined by practical statistics in the
popular S-UGC platform, which is significant for the devel-
opment and measurement of S-UGC VQA. To provide accu-
rate annotation for KVQ, a team of professional researchers
specializing in image processing is responsible for the qual-
ity labeling of each S-UGC video with the range of [1-5]
and the interval of 0.5. Despite that, there are still some
videos with similar subjective quality, which makes it hard
to distinguish which is better. To empower our KVQ with
more fine-grained quality estimation capability, we select
500 indistinguishable S-UGC video pairs and provide their
ranked annotations, which are not considered by existing
UGC datasets.
Based on our KVQ benchmark, we introduce the first
Kaleidoscope Short-form UGC Video Quality Evaluator
(KSVQE). In particular, to identify the quality-determined
regions and mitigate the impacts of quality-unrelated con-
tent, it is necessary to enhance the content understanding
capability of our KSVQE. Considering the powerful fine-
grained semantic understanding capability of pre-trained
large vision-language model, CLIP [39], we propose the
quality-aware region selection module (QRS) and content-
adaptive modulation (CaM) for KSVQE. In QRS, the learn-
able quality adapter is introduced to adapt the fine-grained
semantics from pre-trained CLIP as the guidance to identify
the quality-determined regions and keep it, while dropping
the quality-unrelated contents. The CaM is introduced to
enable our KSVQE to perceive the content semantics for
each region, since the subjective quality is also associated
with different contents. To address the indistinguishabil-
ity of distortions in S-UGC videos caused by video shoot-
ing and sophisticated processing workflows, we enhance the
distortion understanding and adaptation capability of our
KSVQE, by incorporating the distortion prior captured with
the distortion-aware model CONTRIQUE [35]. Here, the
CONTRIQUE is efficiently fine-tuned toward the distortion
distribution of our KVQ database with a distortion adapter
under the contrastive loss function. With the above innova-
tions, our KSVQE achieves state-of-the-art performance on
our proposed KVQ dataset, which excessively outperforms
the current best method Dover (retrained with our KVQ) by
0.032 on PLCC and 0.034 on SROCC. Moreover, our pro-
posed KSVQE owns great applicability for the commonly-
used UGC-VQA datasets. The contributions of this paper
are summarized below:
• We built the first large-scale kaleidoscope short-form
video database, termed KVQ, which is composed of 4200
user-uploaded or processed short-form videos collected
from the popular short-form UGC video platform. The
reliable absolute quality label and partial ranked label for
indistinguishable samples are annotated by a group of
professional researchers specializing in image processing.
• We propose the first kaleidoscope short-form video qual-
ity evaluator, termed KSVQE, to solve two primary
challenges in KVQ: (i) unidentified quality-determined
region/content caused by various creation/generation
modes and kaleidoscope content scenarios. (ii) indistin-
guishable distortions caused by sophisticated processing
flows and unprofessional video shooting.
• To enable the content understanding capability of
KSVQE, we propose the quality-aware region selection
module (QRS) and content-adaptive modulation (CaM)
based on the pre-trained large vision-language model,
CLIP. Apart from that, we enhance the distortion under-
standing of KSVQE by designing the distortion-aware
modulation (DaM) via a pre-trained distortion extractor.
• The thorough analysis of our KVQ is provided and
extensive experiments on our proposed KVQ and the
commonly-used UGC VQA datasets have shown the ef-
2

fectiveness and applicability of our proposed KSVQE.
2. Related Work
2.1. UGC-VQA databases
In recent years, to develop more realistic and challenging
video quality assessment (VQA) for user-generated content
(UGC), many UGC databases [11, 37, 44, 47, 56, 67, 74],
have been established collecting videos with authentic dis-
tortions. These databases can be categorized into two types
based on their collection scope. The first category [13, 51,
67]contains UGC databases collected from the real-world
media platform. Notably, LSVQ [67] includes a substan-
tial 39,076 videos. The Second category [26, 74] involves
UGC databases with simulated distortions approximating
realistic online video platforms, containing both originally
distorted and post-compressed videos. Our proposed KVQ
database, gathered from a short video platform, is similar
to the Second category but has two key differences. Firstly,
KVQ focuses extensively on short-form videos with vari-
ous creation modes and kaleidoscope content. Secondly,
KVQ underwent sophisticated video processing workflows
involving pre-processing, enhancement, and transcoding.
2.2. UGC-VQA methods
There are two main streams for user-generated content
video quality assessment (UGC-VQA) [7, 17, 28, 41, 46,
53, 54, 58, 68–71]. The first comprises traditional meth-
ods [18, 19, 36, 41], which are constrained by the limita-
tions of handcrafted features and lack of the adaptability to
handle more complex UGC databases. With the advance-
ment of deep learning, the second stream learning-based
methods often enable superior performance, which can be
categorized into three main types: temporal fusion, multi-
priors fusion, and fragment extraction. Temporal fusion-
based methods [5, 21, 55, 69] aim to adaptively fuse quality
features in the temporal domain. Multi-priors based meth-
ods [20, 30, 46, 74] typically incorporate multi-priors into
quality-aware features for final regression. Fragment-based
methods [52, 53] extract texture-level information and elim-
inate substantial spatio-temporal redundancies. However,
above methods do not incorporate the ability of content-
distortion understanding into the feature extraction process,
which hinders their capability to address the two challenges
in short-form video platforms.
3. Our proposed KVQ Database
To advance the progress of short-form video quality as-
sessment, we built the first large-scale KVQ database, in-
tending to assist the algorithm development. In contrast
to traditional UGC VQA databases [13, 37, 51, 67], our
KVQ database exhibits the following distinctive features
and advantages: (i) special but crucial application sce-
nario, i.e., short-form video platform, (ii) advancing con-
tent creation/generation modes and kaleidoscope contents,
(iii) practical and sophisticated processing workflows, (iv)
unique scoring strategy, i.e., the combination of absolute
and ranking quality score. In the following sections, we
will clarify the above features/advantages in detail.
3.1. Dataset Collection
Our dataset is composed of 4200 S-UGC videos, which
is collected following two principles: (i) ensure the con-
tent diversity and distortion diversity as much as possi-
ble and (ii) satisfy the practical online statistics and appli-
cation/requirements in the popular short-form video plat-
forms. The pipeline of our dataset collection is shown
in Fig. 11. Notably, in practical application, the previ-
ous UGC-VQA methods usually perform poorly for content
generated with advancing creation modes, such as special
effects. Considering that, we collect the datasets from sev-
eral typical creation modes, including three-stage, special
effects, subtitled, live modes (Please see Fig. 1), and other
traditional creation modes. The data are composed of nine
primary content scenarios in the practical short-form video
platform, including landscape, crowd, person, food, por-
trait, computer graphic (termed as CG), caption, and stage.
In this way, these original user-uploaded data contents cover
almost all existing creation modes and scenarios, and the ra-
tio of each category of content satisfies the practical online
statistics. To further align the video features in the practical
platform, we make fine-grained video content adjustments
based on typical 6 video features, i.e., sharpness, complex-
ity, blurriness, noise, blocky, and colorfulness. Based on
the above collection strategies, we collect 600 original user-
uploaded S-UGC videos for next-stage processing.
Most UGC databases, e.g., UGC-VIDEO [26], simulate
the video processing pipeline for UGC videos with single or
simple processing tools, such as transcoding. However, in
practical short-form video platforms, the video processing
pipeline is sophisticated, including different pre-processing,
transcoding, and enhancement tools, intending to enhance
the subjective quality and reduce the coding bitrate. More-
over, the video processing pipeline is adaptive for each
video based on its content and quality. Therefore, to build
an applicable database, we exploit the representative video
processing strategy in a practical short-form video platform
for our KVQ database, which is shown in Fig. 11, where
enhancement ϕ
e
(·), pre-processing ϕ
p
(·), and transcoding
ϕ
t
(·) work in a cascaded manner. Concretely, 50% of high-
quality videos are processed with six transcoding modes,
since they do not need enhancement and pre-processing.
Another 50% of low-quality videos select one enhance-
ment tool from tool pools of de-artifacts, denoise, and de-
blur. Then the pre-processing is made with a probability
of 0.5 for enhanced low-quality data, followed by transcod-
ing. In this way, 3600 processed S-UGC videos are ob-
3

Figure 2. The overview for establishing the KVQ dataset involves several key steps. Initially, we collect the original short-form videos to
cover the primary creation modes and content scenarios. Subsequently, we make fine-grained video content adjustments based on the 6
video features. Finally, sophisticated video processing workflows are applied to incorporate various hybrid distortions.
tained, which can be divided into three groups correspond-
ing to three typical working flows, i.e., ϕ
t
(·), ϕ
t
(ϕ
e
(·)) and
ϕ
t
(ϕ
p
(ϕ
e
(·))). Based on the above collection strategy, we
collect 4200 S-UGC videos as our database. No questions
on licenses existed in this work since the data collection is
authorized by the short-form video platform and owners.
3.2. Human Study
The human study is carried out with 15 professional re-
searchers specializing in image processing in the standard
environment for quality assessment. Despite the profes-
sional labeling, it is still hard to achieve fine-grained ab-
solute scoring with single-stimulus (SS) methods [14]. To
enable the fine-grained evaluation capability, we propose
mixed scoring, where the absolute Mean Opinion Score
(MOS) value is provided for each video with the range of
[1-5] and the interval of 0.5, and the ranking score is pro-
vided for the indistinguishable S-UGC videos. For the ab-
solute MOS value, we follow the standard subjective pro-
cedure in ITU-R BT 500.13 [3]. Each participant is given
the training with unified instruction. After scoring, the data
cleaning process is performed for each video.
We notice that there are two representative indistinguish-
able scenarios. The first scenario occurs for different video
contents (i.e., non-homogeneous video pairs), where the
difference of MOSs is less than 0.5. Another scenario is
that the transcoding levels do not match their assessed qual-
ity order for the same content (i.e., the homogeneous video
pairs) since the adaptive enhancement and preprocessing.
Therefore, to improve the fine-grained evaluation capabil-
ity, we select 250 homogeneous video pairs and 250 non-
homogeneous video pairs for ranking labeling.
Figure 3. The MOS distribution of different semantic categories
(a) and the histogram of the overall MOS distribution (b).
3.3. Subjective Quality Analysis
In this subsection, we conduct a thorough analysis of the
subjective quality score for our KVQ. Specifically, we visu-
alize the MOS distribution for 9 content scenarios in Fig. 3.
We can observe that the MOS distributions of different con-
tents are similar except for the night and stage scenarios,
due to that the dark night scenario and complex stage mo-
tion are prone to cause a bad perception experience.
To investigate the impacts of different processing work-
flows on subjective quality, we visualize the MOS distribu-
tion of three video groups. As stated in section 3.1, based
on the distortions in 600 original S-UGC videos, we can
divide it into three groups, where the high-quality video
group 1 is only processed with different transcoding modes.
From Fig. 5, we can observe that the subjective quality will
decrease with the QP increases since the compression ar-
tifacts increase. By comparing the subjective quality of
original videos and the processed ones in the first and sec-
ond QP intervals in Video Group 2 (i.e., processed with en-
hancement and transcoding), we can find that the enhance-
ment tools can improve the subjective quality effectively de-
4
剩余18页未读,继续阅读
资源评论


码流怪侠

- 粉丝: 4w+
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 双闭环直流调速系统设计及matlab仿真验证(.doc
- 单片机秒表研究设计课程研究设计报告.doc
- 网络资源在高中信息技术教学中的应用分析.docx
- (源码)基于Go语言的TikBase分布式KV存储系统.zip
- 电脑游戏录屏软件使用的具体步骤.docx
- 公路工程施工项目管理技术的应用研究.docx
- 大数据背景下的图书馆信息咨询服务探究.docx
- 云计算安全可靠性研究-软件技术.doc
- 第一章ChemCAD软件介绍.doc
- 农业机械设计制造中自动化技术的应用探析.docx
- vue3-ts-cesium-map-show-Typescript资源
- 四川建龙软件全套表格2018(监理).doc
- docopt.go-Go资源
- 潮州美食网网站建设毕业方案.doc
- Apache-php-mysql在windows下的安装与配置图解(最新版)9.doc
- 在中职计算机教学中实施多元化评价的探究.docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
